AI doesn't see when humans struggle

They analyze text. But 93% of emotional information lives in voice, face, and behavior. Synstate builds the perception layer that's missing.

🔴
Stress
Physiological tension
😴
Drowsiness
Alertness decline
🧠
Cognitive Load
Mental exhaustion
Fatigue
Accumulated tiredness

Objective signals, clinical evidence

Unlike emotion recognition (scientifically contested), these physiological markers have strong empirical validation for detecting states.

👁️

Visual Signals

PERCLOS (eye closure) ~95%
Blink frequency ~90%
Gaze dispersion ~85%
Head pose stability ~88%
🎤

Acoustic Signals

Speech rate variation ~82%
Pitch jitter ~78%
Pause patterns ~80%
Voice tremor ~75%
⌨️

Behavioral Signals

Keystroke dynamics ~86%
Error rate deviation ~84%
Mouse entropy ~79%
Task switching ~81%

Why it works with missing data

Information Redundancy Theory: emotions manifest simultaneously across face, voice, and behavior. Losing one modality ≠ losing all information.

Interactive modality demo

Toggle modalities on and off to see how combined accuracy changes.

Face
Voice
Behavior
Combined accuracy 85%
Clinical grade
Wu et al. (2024) Ma et al. CVPR 2022 SMIL — AAAI 2021

Three streams, one meaning

How separate modality encoders converge through cross-modal attention into a unified state vector.

1

Modality encoders

Each stream runs through a specialized encoder that produces dense embeddings.

ViT-B/16 Wav2Vec2-Large BERT-base
2

Cross-modal attention

6-layer crossmodal transformer (MulT architecture). Query from modality A, keys & values from modality B — learning which signals matter across streams.

MulT Crossmodal Q/K/V
3

Shared embedding space

All modalities aligned in the same geometric space (ImageBind approach). Same emotional content clusters regardless of source modality.

ImageBind Contrastive learning
4

State vector output

Continuous dimensional scores (valence, arousal, dominance) with uncertainty bounds. Not discrete labels — preserves nuance.

3D continuous ±5-8% CI

How it remembers you

Dual-memory system: short-term for the current session, long-term for personal patterns compressed across weeks.

temporal_buffer idle

Short-term memory

Stores Last 5 min
Size ~15,000 tokens
Purpose Immediate context

Long-term memory

Stores Patterns over weeks
Size ~2,000 tokens
Purpose Personal baseline

Memory compression (7 days → 2,000 tokens)

Raw
~1.2M tokens
Compressed
~2K tokens (600×)

Uses temporal attention pooling (MovieChat architecture)

Universal → Cultural → Individual

Three-layer architecture: start with universal emotion recognition, adapt to cultural norms, then personalize to unique expressions.

Personal
LoRA
Universal base 78% baseline
Cultural adapter +5% → 83%
Personal LoRA +2% → 85%
Layer 1

Universal base model

Cross-cultural emotion recognition covering 6 basic emotions, facial action units, and prosody patterns across diverse populations.

78% baseline
Layer 2

Cultural adapter

Adapts to cultural display rules, context-dependent emotion norms, and regional prosody variations.

+5% → 83%
Layer 3

Personal LoRA

Adapts to YOUR unique baseline over 7-14 days. Your “neutral” face, voice pitch calibration, personal stress-behavior correlations.

LoRA adapters +2% → 85%

Baseline calibration timeline

Day 1
Population avg
Day 4
Partial personal
Day 7
Mostly calibrated
Day 14
Fully personal

15 papers that made this possible

IEEE TPAMI 2019

Multimodal Machine Learning: A Survey and Taxonomy

Baltrusaitis, Ahuja & Morency

ACL 2019

Multimodal Transformer for Unaligned Multimodal Language Sequences (MulT)

Tsai, Bai, Liang et al.

CVPR 2023

ImageBind: One Embedding Space To Bind Them All

Girdhar, El-Nouby, Liu et al.

CVPR 2022

Are Multimodal Transformers Robust to Missing Modality?

Ma, Ren, Zhao et al.

AAAI 2021

SMIL: Multimodal Learning with Severely Missing Modality

Ma, Ren, Zhao, Testuggine & Peng

NeurIPS 2013

Distributed Representations of Words and Phrases (Word2Vec)

Mikolov, Sutskever, Chen, Corrado & Dean

NeurIPS 2017

Attention Is All You Need

Vaswani, Shazeer, Parmar et al.

ICML 2021

Learning Transferable Visual Models From Natural Language Supervision (CLIP)

Radford, Kim, Hallacy et al.

CVPR 2023

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Song, Soleymani, Morency et al.

ICLR 2022

LoRA: Low-Rank Adaptation of Large Language Models

Hu, Shen, Wallis et al.

Psych Sci 2019

Emotional Expressions Reconsidered

Barrett, Adolphs, Marsella, Martinez & Pollak

J Pers Soc Psych 1980

A Circumplex Model of Affect

Russell

J Cross-Cult Psych 2012

Culture and Emotion: Integration of Biological and Cultural Contributions

Matsumoto & Hwang

Cognition & Emotion 1992

An Argument for Basic Emotions

Ekman

Psych Bulletin 2003

Cues to Deception

DePaulo, Lindsay, Malone et al.

What we can't do (yet)

Honest engineering means being clear about boundaries. Barrett's critique is right that discrete emotion labels from faces alone are unreliable. Here's how we address it.

What we don't claim

  • × “We read your exact emotions”
  • × “Facial expressions = universal truth”
  • × “Works perfectly for everyone”
  • × Reliable deception detection (<60% accuracy)

Known limitations

Accuracy drops 12-15% for underrepresented groups. Requires 7-14 days for personal calibration. On-device = smaller models (~3% less accurate than cloud).

What we do

  • Detect statistical patterns in multimodal signals
  • Compare to your personal baseline, not universal templates
  • Report dimensional scores (valence, arousal), not discrete labels
  • Show uncertainty bounds (±5-8% confidence intervals)

Dimensional model

Not “you are ANGRY” but valence: -0.6, arousal: +0.8, dominance: +0.3. Continuous scores preserve nuance.

EU AI Act compliance

Classification: “Limited Risk” system (Article 52)

Users informed AI is analyzing signals Opt-in consent required All processing on-device (GDPR Art. 25) Export / delete personal data anytime NOT used for employment decisions NOT used for law enforcement NOT used for social scoring