Why the Next Leap in AI Video is Teaching Avatars to See and Listen
July 2, 2026 – 1:39 pm
Image by: Canva
TL;DR
AI video is shifting from a fidelity race to an interactivity race. A new class of interactive avatar models can be graded on three levels: Level 1 (talk), Level 2 (talk and listen), and Level 3 (talk, listen, and see). The breakthrough occurs when an avatar learns to listen and react in real time, transforming a talking face into a convincing conversational partner.
For years, progress in generative video and AI avatars has focused on sharper detail, better physics, and smoother motion. However, the evolution of online media formats is shifting the focus towards interactivity. Software is increasingly mediated by agents, and hybrid architectures blending autoregressive and diffusion methods are gaining traction. This is paving the way for new applications like open world simulations and live dialogue.
Interactivity, not resolution, is becoming the frontier.
Consequently, a new category of video models is emerging, designed to produce talking avatars that react to humans in real time, with latencies low enough for natural conversation. Inspired by self-driving cars’ six levels of automation, these Interactive Avatar Models come in three levels:
-
Level 1: Can talk but has no awareness of the person in front of it. Most current talking avatar systems reach this level.
-
Level 2: Can talk and listen, reacting while the other person speaks with small visual cues like nods or shifts in expression, and vocal cues such as "mhm." This is a more challenging problem than Level 1 because the model interprets incoming signals and responds continuously in real-time.
-
Level 3: Can talk, listen, and see, reacting to posture, gesture, and facial expressions like a video call.
The goal of moving beyond Level 1 models is to create avatars that feel alive and responsive, rather than just mechanically talking without regard for their conversation partner.