Inter-1 is an omni-modal model purpose-built for understanding human social signals. It doesn't just look at what people say — it sees how they say it, what their body communicates, and what all of it actually means.
Someonesays"I'mfine."Armscrossed.Gazeaverted.Voiceflat.Thewordssayonething.Everythingelsesaysanother.
Humansreadthisinstantly.AIdoesn't.MostAImodels—includingthebestfrontiermodels—arelanguage-first.Theyanalyzetranscripts,understandspeech,processimages.Buttheymissmostofwhathumancommunicationactuallyis.
Decadesofbehavioralscienceshowthatcommunicationisinherentlymultimodal.Gesture,posture,gaze,timing,vocalprosody—theyallshapehowwordsland.Especiallywhentheyreinforceorcontradicttheverbalmessage.
Currentmodelsreadthetranscript.Theymisstherest.
Inter-1 is an omni-modal model that processes video, audio, and text together, in temporal alignment. It detects and explains 12 social signals grounded in behavioral science and social psychology.
Video, audio, and text processed in temporal alignment. Not stitched together after the fact — synchronized from the start. Gaze, posture, prosody, and words analyzed as one stream.
Not emotion labels. Observable behavioral signals derived from research on how humans communicate intent, engagement, affect, and relational dynamics through verbal and nonverbal channels.
Every detection includes a rationale — which cues were observed, across which modalities, and how they map to the detected signal. Auditable, actionable, and grounded in our behavioral science taxonomy.
The 6- or 8-category emotion frameworks that dominate affective computing were designed for discrete, lab-elicited expressions. They don't capture how people communicate in interviews, negotiations, presentations, or clinical conversations.
Inter-1 operates on a different taxonomy: 12 social signals derived from research on how humans communicate intent, engagement, affect, and relational dynamics through verbal and nonverbal channels.
Nodding, verbal affirmation, postural alignment — signs of explicit accord.
Steady gaze, upright posture, deliberate speech — projecting certainty.
Furrowed brow, head tilt, verbal fillers — processing difficulty is showing.
Head shake, crossed arms, contradictory statements — overt opposition.
Gaze drift, postural withdrawal, reduced responsiveness — attention has left.
Forward lean, active gaze, responsive gestures — fully present and oriented.
Sharp tone, tense posture, audible sighs — mounting irritation is surfacing.
Extended pauses, tentative phrasing, gaze breaks — weighing before committing.
Raised eyebrows, leaning in, follow-up questions — curiosity is pulling them forward.
Narrowed eyes, asymmetric smile, qualifying language — not fully buying it.
Vocal tremor, self-touch gestures, rapid blinking — internal pressure leaking out.
Hedging language, shifting posture, incomplete sentences — still searching for a position.
Signals, not feelings. Emotions are internal states. Social signals are the communicative layer — the outward behaviors that externalize those states in ways other people can interpret. A person doesn't broadcast “I am experiencing the emotion of anger.” They furrow their brow, raise their voice, lean forward. And the same cue — a pause before answering — can mean hesitation, careful thought, or discomfort depending on context.
Most models produce a label and a confidence score. Inter-1 produces a rationale — a structured explanation of which behavioral cues it observed, which modalities those cues came from, and how they map to the detected signal.
Rationale:The speaker pauses for 3.2 seconds before responding. Tentative phrasing follows — “I think maybe we could…” — with rising intonation. Gaze breaks to upper-left during the pause. Right hand touches chin briefly (self-adapter). These cues converge across vocal, visual, and verbal channels, consistent with weighing options before committing.
Check the model's reasoning against specific cues and timestamps. Point to evidence instead of arguing over a label.
Understand what the model actually observed. Use the rationale to get additional context about the conversation or forward it to an LLM.
An estimated probability for every detection gives you a simple read on confidence. Decide when a case needs closer review.
We're building a model that interprets human behavior, and we believe that carries responsibility. Any model that reads human behavior should be able to explain itself.
When we evaluated existing models — including multimodal frontier models — the most consistent weakness was that their outputs are dominated by verbal content.
Show them a video and ask them to analyze engagement: they describe what was said. They might note someone is “on screen” or “speaking.” But they won't tell you the speaker broke eye contact, shifted their posture, and paused mid-sentence.
Inter-1 is trained to treat nonverbal cues as core evidence.
Eye contact, aversion, tracking patterns
Pitch, rhythm, pace, emphasis, pauses
Lean direction, orientation, openness
Illustrators, adaptors, emphasis markers
Fleeting facial movements, action units
Response latency, pause duration, turn-taking
We evaluated Inter-1 against ~15 carefully selected models across the frontier, large, medium, and small tiers — including the best available commercial APIs and leading open-weight models.
Higher is better. Single metric across all 12 signals.
Upper-right is better. Log-scale inference speed.
Higher is better. Four representative social signals.
* Single, consistently applied evaluation metric across all 12 signals. One number, reproducibly measured.
** Models tested using their best available configuration. No cherry-picked prompts for competitors.
A benchmark score tells you how well a model performs against a reference label. It doesn't tell you whether the reasoning is trustworthy. So we ran a blind A/B evaluation with experts who have backgrounds in behavioral science and clinical psychology.
The speaker's voice is sharp and slightly louder than a normal conversational tone. He uses a sharp, pointed gesture with his index finger while speaking, which conveys a sense of annoyance. Right shoulder rises slightly — a protective micro-posture that often co-occurs with frustration.
Lips press tightly together before speaking, and the jaw appears set; voice increases slightly in volume on an emphasized word. Eyebrows draw inward briefly.
The speaker pauses for a significant amount of time before continuing his sentence, and uses filler words like "uh" and "odd" to bridge the gap. He briefly looks away from the camera while searching for his words. Speech rhythm slows noticeably.
Eyes partially closed with reduced blinking, gaze directed slightly downward. Head and upper body remain still with minimal visible gestures. Overall low movement and sustained posture with limited expressivity.
The 12 signals we detect today are the start. The infrastructure we've built — the dataset, the taxonomy, the annotation pipeline, the evaluation framework — is designed to scale.
Beyond the initial 12 to include culturally variable signals and context-specific behavioral patterns.
Getting Inter-1 fast enough for live conversation analysis. As soon as possible.
Currently optimized for single-speaker-in-frame. Multi-person scenes are on the roadmap.
Adapting to individual behavioral patterns rather than relying only on population-level norms.
For privacy-sensitive use cases where sending video to an API is unacceptable.
Continuously growing our dataset with more demographics, contexts, and interaction types.
Inter-1 is available now via the Interhuman AI API. Join our developer community to stay updated on what's coming.
Join developers who are adding social intelligence to their products. Free tier available.