OpenAI launches GPT-Realtime-2 and two new voice API models

OpenAI Launches GPT-Realtime-2 and Two New Voice API Models

May 8, 2026 - 9:55 am

GPT-Realtime-2 brings GPT-5-class reasoning to live voice. A separate translation model covers over 70 input languages. A streaming Whisper variant handles transcription. The pricing is competitive enough to be noteworthy.

OpenAI has released three new voice models in its API, expanding the options for developers integrating GPT-class reasoning into live audio. These include:

  • GPT-Realtime-2, a successor to their existing real-time voice model with GPT-5-class reasoning.
  • GPT-Realtime-Translate, a live translation model supporting over 70 input and 13 output languages.
  • GPT-Realtime-Whisper, a streaming speech-to-text model optimized for low-latency transcription.

This release arrives amidst a surge in voice AI development, with enterprises rapidly building voice agents using a piecemeal approach that combines various components: Whisper or Deepgram for transcription, ElevenLabs or Cartesia for text-to-speech, GPT-4 or Claude for reasoning, and custom logic for turn-taking and barge-in.

OpenAI's offering is unique as it provides a single model handling both audio input and output with reasoning integrated directly into the audio loop.

What's New?

GPT-Realtime-2 incorporates several capabilities previously achieved through prompt scaffolding:

  • Preambles: Allow agents to express actions, like "let me check that," while calling tools, preventing users from experiencing silence.
  • Parallel tool calls: Enable the model to send multiple backend requests simultaneously and narrate the ongoing process.
  • Improved recovery behavior: Handles failures gracefully, rather than freezing conversations.
  • Tone control: Agents can adjust their tone based on context, like using a calmer tone for support cases or a more upbeat tone for confirmations.

Two key technical advancements stand out:

  • Increased context window: Now 128K, up from 32K, enabling longer sessions and more complex agentic flows without external state stitching.
  • Reasoning effort control: A knob allows users to set minimal, low, medium, high, or xhigh effort, with low as the default for maintaining tight latency.

On OpenAI's benchmarks:

  • GPT-Realtime-2 at high effort outperforms GPT-Realtime-1.5 by 15.2% on Big Bench Audio and by 13.8% on Audio MultiChallenge for instruction following.

Customer benchmarks demonstrate even sharper improvements, with Zillow reporting a 26-point lift in call success rate from 69% to 95%. BolnaAI, a voice AI company focusing on Indian languages, reports 12.5% lower word error rates using the translation model for Hindi, Tamil, and Telugu.

GPT-Realtime-2 is priced at $32 per million audio input tokens, $0.40 for cached input tokens, and $64 per million audio output tokens. GPT-Realtime-Translate is priced at $0.034 per token.