They analyze text. But 93% of emotional information lives in voice, face, and behavior. Synstate builds the perception layer that's missing.
Unlike emotion recognition (scientifically contested), these physiological markers have strong empirical validation for detecting states.
Information Redundancy Theory: emotions manifest simultaneously across face, voice, and behavior. Losing one modality ≠ losing all information.
Toggle modalities on and off to see how combined accuracy changes.
How separate modality encoders converge through cross-modal attention into a unified state vector.
Each stream runs through a specialized encoder that produces dense embeddings.
6-layer crossmodal transformer (MulT architecture). Query from modality A, keys & values from modality B — learning which signals matter across streams.
All modalities aligned in the same geometric space (ImageBind approach). Same emotional content clusters regardless of source modality.
Continuous dimensional scores (valence, arousal, dominance) with uncertainty bounds. Not discrete labels — preserves nuance.
Dual-memory system: short-term for the current session, long-term for personal patterns compressed across weeks.
Three-layer architecture: start with universal emotion recognition, adapt to cultural norms, then personalize to unique expressions.
Cross-cultural emotion recognition covering 6 basic emotions, facial action units, and prosody patterns across diverse populations.
Adapts to cultural display rules, context-dependent emotion norms, and regional prosody variations.
Adapts to YOUR unique baseline over 7-14 days. Your “neutral” face, voice pitch calibration, personal stress-behavior correlations.
Multimodal Machine Learning: A Survey and Taxonomy
Multimodal Transformer for Unaligned Multimodal Language Sequences (MulT)
ImageBind: One Embedding Space To Bind Them All
Are Multimodal Transformers Robust to Missing Modality?
SMIL: Multimodal Learning with Severely Missing Modality
Distributed Representations of Words and Phrases (Word2Vec)
Attention Is All You Need
Learning Transferable Visual Models From Natural Language Supervision (CLIP)
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
LoRA: Low-Rank Adaptation of Large Language Models
Emotional Expressions Reconsidered
A Circumplex Model of Affect
Culture and Emotion: Integration of Biological and Cultural Contributions
An Argument for Basic Emotions
Cues to Deception
Honest engineering means being clear about boundaries. Barrett's critique is right that discrete emotion labels from faces alone are unreliable. Here's how we address it.
Accuracy drops 12-15% for underrepresented groups. Requires 7-14 days for personal calibration. On-device = smaller models (~3% less accurate than cloud).
Not “you are ANGRY” but valence: -0.6, arousal: +0.8, dominance: +0.3. Continuous scores preserve nuance.
Classification: “Limited Risk” system (Article 52)