OpenAI Launches GPT-Realtime-2 and Two New Voice API Models
May 8, 2026 - 9:55 am
GPT-Realtime-2 brings GPT-5-class reasoning to live voice. A separate translation model covers over 70 input languages. A streaming Whisper variant handles transcription. The pricing is competitive enough to be noteworthy.
OpenAI has released three new voice models in its API, expanding the options for developers integrating GPT-class reasoning into live audio. These include:
- GPT-Realtime-2, a successor to their existing real-time voice model with GPT-5-class reasoning.
- GPT-Realtime-Translate, a live translation model supporting over 70 input and 13 output languages.
- GPT-Realtime-Whisper, a streaming speech-to-text model optimized for low-latency transcription.
This release arrives amidst a surge in voice AI development, with enterprises rapidly building voice agents using a piecemeal approach that combines various components: Whisper or Deepgram for transcription, ElevenLabs or Cartesia for text-to-speech, GPT-4 or Claude for reasoning, and custom logic for turn-taking and barge-in.
OpenAI's offering is unique as it provides a single model handling both audio input and output with reasoning integrated directly into the audio loop.
What's New?
GPT-Realtime-2 incorporates several capabilities previously achieved through prompt scaffolding:
- Preambles: Allow agents to express actions, like "let me check that," while calling tools, preventing users from experiencing silence.
- Parallel tool calls: Enable the model to send multiple backend requests simultaneously and narrate the ongoing process.
- Improved recovery behavior: Handles failures gracefully, rather than freezing conversations.
- Tone control: Agents can adjust their tone based on context, like using a calmer tone for support cases or a more upbeat tone for confirmations.
Two key technical advancements stand out:
- Increased context window: Now 128K, up from 32K, enabling longer sessions and more complex agentic flows without external state stitching.
- Reasoning effort control: A knob allows users to set minimal, low, medium, high, or xhigh effort, with low as the default for maintaining tight latency.
On OpenAI's benchmarks:
- GPT-Realtime-2 at high effort outperforms GPT-Realtime-1.5 by 15.2% on Big Bench Audio and by 13.8% on Audio MultiChallenge for instruction following.
Customer benchmarks demonstrate even sharper improvements, with Zillow reporting a 26-point lift in call success rate from 69% to 95%. BolnaAI, a voice AI company focusing on Indian languages, reports 12.5% lower word error rates using the translation model for Hindi, Tamil, and Telugu.
GPT-Realtime-2 is priced at $32 per million audio input tokens, $0.40 for cached input tokens, and $64 per million audio output tokens. GPT-Realtime-Translate is priced at $0.034 per token.