The model that reads
what words can't say

The problem

AI still mostly listens to what you say

Someonesays"I'mfine."Armscrossed.Gazeaverted.Voiceflat.Thewordssayonething.Everythingelsesaysanother.

Humansreadthisinstantly.AIdoesn't.MostAImodelsincludingthebestfrontiermodelsarelanguage-first.Theyanalyzetranscripts,understandspeech,processimages.Buttheymissmostofwhathumancommunicationactuallyis.

Decadesofbehavioralscienceshowthatcommunicationisinherentlymultimodal.Gesture,posture,gaze,timing,vocalprosodytheyallshapehowwordsland.Especiallywhentheyreinforceorcontradicttheverbalmessage.

Currentmodelsreadthetranscript.Theymisstherest.

The model

Built to close the gap

Inter-1 is an omni-modal model that processes video, audio, and text together, in temporal alignment. It detects and explains 12 social signals grounded in behavioral science and social psychology.

Omni-modal perception

Video, audio, and text processed in temporal alignment. Not stitched together after the fact — synchronized from the start. Gaze, posture, prosody, and words analyzed as one stream.

12 social signals

Not emotion labels. Observable behavioral signals derived from research on how humans communicate intent, engagement, affect, and relational dynamics through verbal and nonverbal channels.

Explainable by default

Every detection includes a rationale — which cues were observed, across which modalities, and how they map to the detected signal. Auditable, actionable, and grounded in our behavioral science taxonomy.

Taxonomy

12 social signals — not emotion labels

The 6- or 8-category emotion frameworks that dominate affective computing were designed for discrete, lab-elicited expressions. They don't capture how people communicate in interviews, negotiations, presentations, or clinical conversations.

Inter-1 operates on a different taxonomy: 12 social signals derived from research on how humans communicate intent, engagement, affect, and relational dynamics through verbal and nonverbal channels.

Signals, not feelings. Emotions are internal states. Social signals are the communicative layer — the outward behaviors that externalize those states in ways other people can interpret. A person doesn't broadcast “I am experiencing the emotion of anger.” They furrow their brow, raise their voice, lean forward. And the same cue — a pause before answering — can mean hesitation, careful thought, or discomfort depending on context.

Explainability

Beyond labels: rationale as output

Most models produce a label and a confidence score. Inter-1 produces a rationale — a structured explanation of which behavioral cues it observed, which modalities those cues came from, and how they map to the detected signal.

Example detection
Hesitation22.0s — 25.8s · probability: medium

Rationale:The speaker pauses for 3.2 seconds before responding. Tentative phrasing follows — “I think maybe we could…” — with rising intonation. Gaze breaks to upper-left during the pause. Right hand touches chin briefly (self-adapter). These cues converge across vocal, visual, and verbal channels, consistent with weighing options before committing.

01

Auditable

Check the model's reasoning against specific cues and timestamps. Point to evidence instead of arguing over a label.

02

Actionable

Understand what the model actually observed. Use the rationale to get additional context about the conversation or forward it to an LLM.

03

Calibrated

An estimated probability for every detection gives you a simple read on confidence. Decide when a case needs closer review.

We're building a model that interprets human behavior, and we believe that carries responsibility. Any model that reads human behavior should be able to explain itself.

Where it looks

Inter-1 pays attention to what matters

When we evaluated existing models — including multimodal frontier models — the most consistent weakness was that their outputs are dominated by verbal content.

Show them a video and ask them to analyze engagement: they describe what was said. They might note someone is “on screen” or “speaking.” But they won't tell you the speaker broke eye contact, shifted their posture, and paused mid-sentence.

Inter-1 is trained to treat nonverbal cues as core evidence.

Benchmarks

Outperforms across all 12 signals

We evaluated Inter-1 against ~15 carefully selected models across the frontier, large, medium, and small tiers — including the best available commercial APIs and leading open-weight models.

Social Signal Detection — Macro F1

Higher is better. Single metric across all 12 signals.

40.9%36.6%36.3%27.9%27.7%21.2%21.1%13.8%13.6%13.5%12.3%Inter-1Gemini 3.1 ProGemini 2.5 ProGemini 2.5 Flash LiteQwen3-Omni-FlashMistral Large 3Qwen3.6-PlusKimi-K2.5GPT-5.4GLM-4.6V-FlashGrok 4.101020304050

Accuracy vs Speed

Upper-right is better. Log-scale inference speed.

Per-Signal Accuracy — Selected Signals

Higher is better. Four representative social signals.

Inter-1
Gemini 3.1 Pro
GPT-5.4
Grok 4.1
41.5%36.1%22.1%16.3%54.7%49.3%30.3%17.4%59.6%34.7%22.7%17.5%48.7%41.3%9.1%11.5%InterestSkepticismStressUncertainty0%10%20%30%40%50%60%

* Single, consistently applied evaluation metric across all 12 signals. One number, reproducibly measured.

** Models tested using their best available configuration. No cherry-picked prompts for competitors.

Expert evaluation

Preferred 83% of the time by behavioral scientists

A benchmark score tells you how well a model performs against a reference label. It doesn't tell you whether the reasoning is trustworthy. So we ran a blind A/B evaluation with experts who have backgrounds in behavioral science and clinical psychology.

0%Overall preference
0%Evidential grounding
0%Clarity & specificity

Side by side

Speaker responding to a difficult question
Inter-1

The speaker's voice is sharp and slightly louder than a normal conversational tone. He uses a sharp, pointed gesture with his index finger while speaking, which conveys a sense of annoyance. Right shoulder rises slightly — a protective micro-posture that often co-occurs with frustration.

Baseline model

Lips press tightly together before speaking, and the jaw appears set; voice increases slightly in volume on an emphasized word. Eyebrows draw inward briefly.

Speaker searching for words mid-explanation
Inter-1

The speaker pauses for a significant amount of time before continuing his sentence, and uses filler words like "uh" and "odd" to bridge the gap. He briefly looks away from the camera while searching for his words. Speech rhythm slows noticeably.

Baseline model

Eyes partially closed with reduced blinking, gaze directed slightly downward. Head and upper body remain still with minimal visible gestures. Overall low movement and sustained posture with limited expressivity.

What's next

Inter-1 is version one

The 12 signals we detect today are the start. The infrastructure we've built — the dataset, the taxonomy, the annotation pipeline, the evaluation framework — is designed to scale.

01

Expanded signal taxonomy

Beyond the initial 12 to include culturally variable signals and context-specific behavioral patterns.

02

Real-time streaming inference

Getting Inter-1 fast enough for live conversation analysis. As soon as possible.

03

Multi-person interaction

Currently optimized for single-speaker-in-frame. Multi-person scenes are on the roadmap.

04

Baseline-aware detection

Adapting to individual behavioral patterns rather than relying only on population-level norms.

05

On-device inference

For privacy-sensitive use cases where sending video to an API is unacceptable.

06

Purpose-built dataset expansion

Continuously growing our dataset with more demographics, contexts, and interaction types.

Inter-1 is available now via the Interhuman AI API. Join our developer community to stay updated on what's coming.

Start building withInter-1 today

Join developers who are adding social intelligence to their products. Free tier available.

About the model