← Back to Blog
Engineering
Solving the 500ms Latency Barrier in AI Voice
How we optimized our voice stack to achieve sub-second response times for natural conversation.
Humans perceive a pause of longer than 500ms as "slow" in a conversation. Standard LLM pipelines (Speech-to-Text -> LLM -> Text-to-Speech) often take 2-3 seconds. That kills the vibe.
The Stack
To fix this, we moved away from HTTP requests and built a pure WebSocket pipeline.
- VAD (Voice Activity Detection): Running locally on the edge to detect when the user stops speaking instantly.
- Streaming STT: Deepgram Nova-2 for ultra-fast transcription.
- Groq Inference: Using LPU inference engines to get tokens out in <100ms.
- Streaming TTS: ElevenLabs Turbo v2.5 for instant audio generation.
The Result
We achieved a median round-trip latency of 700ms, which feels near-instant. The user can interrupt the AI, and it stops speaking immediately. This is the difference between a "robot" and a "digital employee".