Pricing

Pricing

Introducing Inter-1

The model that reads
what words can't say

Inter-1 is an omni-modal model purpose-built for understanding human social signals. It doesn't just look at what people say — it sees how they say it, what their body communicates, and what all of it actually means.

Get API access Read the docs

3modalitiesvideo + audio + text

12social signalsgrounded in behavioral science

1rationalefor every detection

The problem

AI still mostly listens to what you say

Someonesays"I'mfine."Armscrossed.Gazeaverted.Voiceflat.Thewordssayonething.Everythingelsesaysanother.

Humansreadthisinstantly.AIdoesn't.MostAImodels—includingthebestfrontiermodels—arelanguage-first.Theyanalyzetranscripts,understandspeech,processimages.Buttheymissmostofwhathumancommunicationactuallyis.

Decadesofbehavioralscienceshowthatcommunicationisinherentlymultimodal.Gesture,posture,gaze,timing,vocalprosody—theyallshapehowwordsland.Especiallywhentheyreinforceorcontradicttheverbalmessage.

Currentmodelsreadthetranscript.Theymisstherest.

The model

Built to close the gap

Inter-1 is an omni-modal model that processes video, audio, and text together, in temporal alignment. It detects and explains 12 social signals grounded in behavioral science and social psychology.

Omni-modal perception

Video, audio, and text processed in temporal alignment. Not stitched together after the fact — synchronized from the start. Gaze, posture, prosody, and words analyzed as one stream.

12 social signals

Not emotion labels. Observable behavioral signals derived from research on how humans communicate intent, engagement, affect, and relational dynamics through verbal and nonverbal channels.

Explainable by default

Every detection includes a rationale — which cues were observed, across which modalities, and how they map to the detected signal. Auditable, actionable, and grounded in our behavioral science taxonomy.

Taxonomy

12 social signals — not emotion labels

The 6- or 8-category emotion frameworks that dominate affective computing were designed for discrete, lab-elicited expressions. They don't capture how people communicate in interviews, negotiations, presentations, or clinical conversations.

Inter-1 operates on a different taxonomy: 12 social signals derived from research on how humans communicate intent, engagement, affect, and relational dynamics through verbal and nonverbal channels.

Agreement

Nodding, verbal affirmation, postural alignment — signs of explicit accord.

Confidence

Steady gaze, upright posture, deliberate speech — projecting certainty.

Confusion

Furrowed brow, head tilt, verbal fillers — processing difficulty is showing.

Disagreement

Head shake, crossed arms, contradictory statements — overt opposition.

Disengagement

Gaze drift, postural withdrawal, reduced responsiveness — attention has left.

Engagement

Forward lean, active gaze, responsive gestures — fully present and oriented.

Frustration

Sharp tone, tense posture, audible sighs — mounting irritation is surfacing.

Hesitation

Extended pauses, tentative phrasing, gaze breaks — weighing before committing.

Interest

Raised eyebrows, leaning in, follow-up questions — curiosity is pulling them forward.

Skepticism

Narrowed eyes, asymmetric smile, qualifying language — not fully buying it.

Stress

Vocal tremor, self-touch gestures, rapid blinking — internal pressure leaking out.

Uncertainty

Hedging language, shifting posture, incomplete sentences — still searching for a position.

Signals, not feelings. Emotions are internal states. Social signals are the communicative layer — the outward behaviors that externalize those states in ways other people can interpret. A person doesn't broadcast “I am experiencing the emotion of anger.” They furrow their brow, raise their voice, lean forward. And the same cue — a pause before answering — can mean hesitation, careful thought, or discomfort depending on context.

Explainability

Beyond labels: rationale as output

Most models produce a label and a confidence score. Inter-1 produces a rationale — a structured explanation of which behavioral cues it observed, which modalities those cues came from, and how they map to the detected signal.

Example detection

Hesitation22.0s — 25.8s · probability: medium

Rationale:The speaker pauses for 3.2 seconds before responding. Tentative phrasing follows — “I think maybe we could…” — with rising intonation. Gaze breaks to upper-left during the pause. Right hand touches chin briefly (self-adapter). These cues converge across vocal, visual, and verbal channels, consistent with weighing options before committing.

Auditable

Check the model's reasoning against specific cues and timestamps. Point to evidence instead of arguing over a label.

Actionable

Understand what the model actually observed. Use the rationale to get additional context about the conversation or forward it to an LLM.

Calibrated

An estimated probability for every detection gives you a simple read on confidence. Decide when a case needs closer review.

We're building a model that interprets human behavior, and we believe that carries responsibility. Any model that reads human behavior should be able to explain itself.

Where it looks

Inter-1 pays attention to what matters

When we evaluated existing models — including multimodal frontier models — the most consistent weakness was that their outputs are dominated by verbal content.

Show them a video and ask them to analyze engagement: they describe what was said. They might note someone is “on screen” or “speaking.” But they won't tell you the speaker broke eye contact, shifted their posture, and paused mid-sentence.

Inter-1 is trained to treat nonverbal cues as core evidence.

👁

Gaze direction

Eye contact, aversion, tracking patterns

🗣

Vocal prosody

Pitch, rhythm, pace, emphasis, pauses

🤸

Postural shifts

Lean direction, orientation, openness

🤌

Hand gestures

Illustrators, adaptors, emphasis markers

😐

Micro-expressions

Fleeting facial movements, action units

⏱

Speech timing

Response latency, pause duration, turn-taking

Benchmarks

Outperforms across all 12 signals

We evaluated Inter-1 against ~15 carefully selected models across the frontier, large, medium, and small tiers — including the best available commercial APIs and leading open-weight models.

Social Signal Detection — Macro F1

Higher is better. Single metric across all 12 signals.

Accuracy vs Speed

Upper-right is better. Log-scale inference speed.

Per-Signal Accuracy — Selected Signals

Higher is better. Four representative social signals.

Inter-1

Gemini 3.1 Pro

GPT-5.4

Grok 4.1

* Single, consistently applied evaluation metric across all 12 signals. One number, reproducibly measured.

** Models tested using their best available configuration. No cherry-picked prompts for competitors.

Expert evaluation

Preferred 83% of the time by behavioral scientists

A benchmark score tells you how well a model performs against a reference label. It doesn't tell you whether the reasoning is trustworthy. So we ran a blind A/B evaluation with experts who have backgrounds in behavioral science and clinical psychology.

0%Overall preference

0%Evidential grounding

0%Clarity & specificity

Side by side

Speaker responding to a difficult question

Inter-1

The speaker's voice is sharp and slightly louder than a normal conversational tone. He uses a sharp, pointed gesture with his index finger while speaking, which conveys a sense of annoyance. Right shoulder rises slightly — a protective micro-posture that often co-occurs with frustration.

Baseline model

Lips press tightly together before speaking, and the jaw appears set; voice increases slightly in volume on an emphasized word. Eyebrows draw inward briefly.

Speaker searching for words mid-explanation

Inter-1

The speaker pauses for a significant amount of time before continuing his sentence, and uses filler words like "uh" and "odd" to bridge the gap. He briefly looks away from the camera while searching for his words. Speech rhythm slows noticeably.

Baseline model

Eyes partially closed with reduced blinking, gaze directed slightly downward. Head and upper body remain still with minimal visible gestures. Overall low movement and sustained posture with limited expressivity.

What's next

Inter-1 is version one

The 12 signals we detect today are the start. The infrastructure we've built — the dataset, the taxonomy, the annotation pipeline, the evaluation framework — is designed to scale.

Expanded signal taxonomy

Beyond the initial 12 to include culturally variable signals and context-specific behavioral patterns.

Real-time streaming inference

Getting Inter-1 fast enough for live conversation analysis. As soon as possible.

Multi-person interaction

Currently optimized for single-speaker-in-frame. Multi-person scenes are on the roadmap.

Baseline-aware detection

Adapting to individual behavioral patterns rather than relying only on population-level norms.

On-device inference

For privacy-sensitive use cases where sending video to an API is unacceptable.

Purpose-built dataset expansion

Continuously growing our dataset with more demographics, contexts, and interaction types.

Inter-1 is available now via the Interhuman AI API. Join our developer community to stay updated on what's coming.

Get API access Join Discord

Start building with
Inter-1 today

Join developers who are adding social intelligence to their products. Free tier available.

Get API key Read docs

The model that readswhat words can't say

AI still mostly listens to what you say

Built to close the gap

Omni-modal perception

12 social signals

Explainable by default

12 social signals — not emotion labels

Beyond labels: rationale as output

Auditable

Actionable

Calibrated

Inter-1 pays attention to what matters

Outperforms across all 12 signals

Social Signal Detection — Macro F1

Accuracy vs Speed

Per-Signal Accuracy — Selected Signals

Preferred 83% of the time by behavioral scientists

Side by side

Inter-1 is version one

Expanded signal taxonomy

Real-time streaming inference

Multi-person interaction

Baseline-aware detection

On-device inference

Purpose-built dataset expansion

Start building withInter-1 today

About the model

How is Inter-1 different from GPT-4o or Gemini on video?

What are "social signals" versus emotions?

What does the rationale contain?

Is Inter-1 accurate across demographics?

Can Inter-1 handle multiple people in frame?

Is there real-time support?

How was Inter-1 evaluated?

The model that reads
what words can't say

Start building with
Inter-1 today