Send a video, get structured JSON — 12 social signals with explanations, engagement level, and conversation quality metrics. One endpoint.
{
"type": "Frustration",
"start": 134.0,
"end": 139.1,
"probability": "high",
"rationale": "The speaker's voice becomes sharper and louder as he speaks, and his tone is firm and accusatory. He uses direct, confrontational language such as 'You have no idea what you've done,' indicating annoyance. His facial expression appears tense and his brows are furrowed."
}
Every response contains all three layers: social signals, an engagement level, and actionable conversation quality scores — from a single endpoint.
Each signal includes a probability level and a human-readable rationale explaining what triggered it.
Alignment with another person's position, intent, or understanding
How firmly and assuredly someone communicates their position or decision
A breakdown or gap in understanding during an interaction
Active divergence from another person's viewpoint or proposal
Reduction in attention, involvement, or investment in the interaction
Sustained focus and active participation in the interaction
Mounting tension or irritation when progress toward a goal feels blocked
Uncertainty or delay before committing to a response or action
Attention or curiosity toward something unexpected or stimulating
Questioning or doubtful stance toward a claim, proposal, or explanation
Heightened tension or unease during an interaction
A temporary disruption in speaking or responding to a question
Alignment with another person's position, intent, or understanding
{ "type": "Agreement", "start": 1.2, "end": 8.7, "probability": "high", "rationale": "The speaker explicitly states, "Yeah, that was great," providing clear verbal confirmation of his stance. He also nods his head while speaking and maintains steady eye contact with the camera, demonstrating active participation and alignment with the topic." }
Detecting buy-in during sales calls, confirming understanding in training, measuring alignment in negotiations
The API computes a Conversation Quality Index (CQI) from the detected signals — a single 0–100 score, broken down into 5 behavioural dimensions. Updated throughout the conversation.
The API returns a single 0–100 score summarising interaction quality, computed from the detected signals. Provided as a snapshot and a rolling timeline.
Each dimension measures the balance between supportive and undermining behaviours detected in the conversation. Higher is always better.
How easy the speaker is to follow, reflecting the organization, concision, and coherence of their ideas.
How confident, decisive, and credible the speaker comes across in their delivery.
The speaker's level of vitality and active engagement throughout the interaction.
The warmth and emotional quality of the interaction, reflecting how acknowledged and at ease the other person feels.
How openly the speaker reflects, tests ideas, and adapts, capturing their curiosity and growth orientation.
An omni-modal model purpose-built for understanding human social signals. It processes video, audio, and text together — in temporal alignment.
Inter-1 is trained specifically for social signal detection — it leverages a proprietary dataset combining behavioural science anchored ontology and expert-led labeling.
Evaluated against a wide variety of models including both commercial and open-source frontier omni and vision LLMs. Inter-1 was found to be best in terms of accuracy and speed.
In blind A/B evaluation, behavioural science experts chose Inter-1's rationale over competitor output 83% of the time (76% on evidential grounding, 91% on clarity).
Sentiment gives you positive/negative. We give you 12 specific signals with timestamps and rationale.
Stitching together face, voice, and body language models is months of work. This is one endpoint.
Every signal includes the observable cues that triggered it — which modalities, which behaviours. Auditable and actionable.
Agreement, confusion, frustration, hesitation, and 8 more — each with a probability level and a rationale explaining the observable cues that triggered it.
Video, audio, and text processed together in temporal alignment. Gaze direction, vocal prosody, posture, gestures, and speech — not just transcripts.
Continuous attention monitoring with three levels — engaged, neutral, or disengaged — indicating how oriented a person is to the interaction at every moment.
A 0–100 score derived from social signals across five behavioural dimensions — Clarity, Authority, Energy, Rapport, and Learning — provided as both a snapshot and a timeline.
Every signal includes its rationale, with the modalities and behaviours involved. Auditable, actionable, and ready to forward to your LLM.
Upload a recording or connect live video. Token-based auth. One response shape for both. Any format ffmpeg supports.