For a voice conversation to feel natural, responses need to arrive in under 500 milliseconds. That threshold isn't arbitrary—it's the limit of human perception for delay in dialogue. Most "real-time" AI products miss it entirely, landing at 1 to 3 seconds. OpenAI's Advanced Voice Mode doesn't. Getting there required abandoning the standard toolkit for real-time audio: WebRTC.
WebRTC is the browser technology everyone uses for audio and video calls. It's proven, battle-tested, and deeply suboptimal for AI voice. The protocol assumes two humans talking—the algorithm that manages "who speaks next" is designed for human conversation patterns, not the precise, predictable behavior of a language model. When an AI starts talking, the standard WebRTC pipeline chokes: it misinterprets the AI's audio as echo, applies aggressive noise suppression that mangles the response, and makes call quality erratic. Most developers accept these constraints and patch around them. OpenAI took a different approach.
The company rebuilt its WebRTC stack from scratch. Not modified—rebuilt. This meant custom voice activity detection to know when a user has finished speaking, custom server-side audio processing to recognize its own generated audio, and custom jitter buffers that maintain sub-500ms latency without sacrificing quality. The result required infrastructure investment across 12 global regions, proprietary network traversal, and real-time protocol optimization that WebRTC's open-source roots never anticipated.
The payoff is a voice assistant that doesn't feel like one. Natural speech patterns work: you can interrupt, speak over responses, change your mind mid-sentence. The system handles all of it seamlessly because it was designed from the ground up for human conversation, not bolted onto a protocol built for human-to-human calls.
This is the choice other AI companies will have to make eventually. Patch around WebRTC's limitations and ship something that works—or invest in the engineering complexity that makes voice AI actually feel natural. OpenAI chose the latter. Whether users will notice the difference remains to be seen, but the technical foundation for genuinely conversational AI now exists.