OpenAI’s Advanced Voice Mode Rolls Out to ChatGPT Plus Users

The Rollout Begins

On July 30, 2024, OpenAI started rolling out its Advanced Voice Mode (AVM) to a subset of ChatGPT Plus subscribers. The feature, first demoed during the GPT-4o launch event in May, replaces the earlier voice mode that relied on three separate models—a speech-to-text model, a language model, and a text-to-speech model—with a single multimodal pipeline. AVM can process pitch, rhythm, and tone directly, enabling it to laugh, whisper, or express excitement without text intermediation. The initial rollout is limited to a small number of Plus users, with a broader launch scheduled for fall 2024.

The Technical Leap Behind Advanced Voice Mode

Unlike the previous voice mode, which had an average latency of roughly 2.8 seconds per round trip, AVM achieves end-to-end voice interaction in under 320 milliseconds—comparable to human conversational turn-taking. OpenAI achieves this by feeding raw audio into GPT-4o’s multimodal attention layers, bypassing the transcription bottleneck. The model also handles interruptions naturally: if a user says “Wait, let me reconsider,” the AI stops mid-sentence and listens. This required retraining the model’s decay parameters to avoid truncating user speech.

Another technical detail is the integration of a non-verbal event detector. When a user coughs, sighs, or laughs, the model can decide whether to acknowledge it or continue the flow, depending on context. In internal benchmarks, AVM correctly identified emotional cues like frustration or hesitation 87% of the time, versus 52% for the previous text-based pipeline. However, the model still relies on a separate voice activity detection module to determine when a user has finished speaking, which can introduce occasional false positives in noisy environments.

Rollout and Availability

The Advanced Voice Mode is initially available only to ChatGPT Plus subscribers in the United States, who pay $20 per month. OpenAI plans to expand to Team and Enterprise tiers later in the fourth quarter of 2024, with an Educational rollout following in early 2025. Users on the free tier will not receive voice mode at all, as company profit margins on inference costs for audio are significantly lower than for text. OpenAI estimates that processing one minute of interactive voice conversation costs roughly eight times more than generating 4,000 tokens of text.

To manage server load, the company has throttled usage to a “limited daily allowance” of approximately 30 minutes of active voice conversation per user per day. This cap may change as inference hardware efficiency improves. OpenAI is also rolling out five new voice options—Breeze, Cove, Ember, Juniper, and Vale—in addition to the existing Sky, Breeze, and Cove. Each voice was trained on a distinct actor’s audio with licensing agreements.

How It Compares to Prior Voice Features

The former voice mode, launched in September 2023, used Whisper for speech-to-text, GPT-4 (or GPT-3.5) for response generation, and an in-house text-to-speech model based on TorToiSe. That pipeline broke down when users wanted to ask follow-ups in a heated discussion: the conversational flow was clunky because the entire transcript had to be re-sent to the language model after each voice round trip. AVM eliminates this by streaming the audio directly into GPT-4o’s autoregressive decoder, allowing the model to maintain a coherent thread over multi-turn voice interactions without visible digressions.

Apple’s Siri and Amazon’s Alexa rely on similar cascaded architectures—voice-to-text, NLU, text-to-speech—and have latencies closer to 800 ms to 1.5 seconds per turn. Google’s Gemini Live, announced in May 2024, also promises a multimodal voice mode, but as of this writing it is still in limited beta and does not support real-time interruption handling. OpenAI claims AVM is the first commercially deployed voice AI that can simulate emotional range without explicit scripted intents.

Safety and Guardrails

OpenAI has implemented several safety measures specific to the Advanced Voice Mode. The system uses a separate “voice mimicry classifier” that detects and blocks any attempt to impersonate a specific person—for example, producing a voice that matches a user’s own timbre for phishing. The model is also prohibited from generating “sensitive” sounds like sirens, crying babies, or sexual noises. During internal red-teaming, the classifier stopped 92% of impersonation attempts, but three edge cases during early testing allowed the model to mimic a user after seven uninterrupted seconds of audio input.

Additionally, OpenAI added a watermark to all generated audio outputs, embedding a unique digital signature that can later be traced to a specific user session. This watermark is imperceptible to humans but can be read by the company’s forensic tool. The company has also restricted the feature from being used in emergency contexts: if a user says “I’m having a heart attack,” the model is trained to respond “I’m not a medical professional; please call 911” rather than providing instructions.

Potential Use Cases and Implications

Early testers have used AVM for language tutoring—correcting pronunciation and rhythm in real time—and for therapeutic-style reflection, where the model adjusts its tone to match the user’s emotional state. Some developers are exploring AVM as a replacement for interactive voice response systems in customer support, but OpenAI’s current API terms prohibit reselling voice mode as a standalone product. The feature also raises privacy questions: all audio clips are temporarily stored on OpenAI’s servers for model improvement unless the user opts out in settings. The company’s privacy policy notes that audio recordings may be reviewed by human annotators but only after removing personally identifiable information.

With AVM, conversational AI has crossed a threshold where the medium itself—tone, timing, emotion—becomes part of the transmitted information rather than a side effect. Whether that leads to deeper user engagement or new forms of manipulation depends on how quickly guardrails evolve alongside the technology.