AIERAX
← Back to Blog
Engineering

Solving the 500ms Latency Barrier in AI Voice

How we optimized our voice stack to achieve sub-second response times for natural conversation.

Humans perceive a pause of longer than 500ms as "slow" in a conversation. Standard LLM pipelines (Speech-to-Text -> LLM -> Text-to-Speech) often take 2-3 seconds. That kills the vibe.

The Stack

To fix this, we moved away from HTTP requests and built a pure WebSocket pipeline.

  1. VAD (Voice Activity Detection): Running locally on the edge to detect when the user stops speaking instantly.
  2. Streaming STT: Deepgram Nova-2 for ultra-fast transcription.
  3. Groq Inference: Using LPU inference engines to get tokens out in <100ms.
  4. Streaming TTS: ElevenLabs Turbo v2.5 for instant audio generation.

The Result

We achieved a median round-trip latency of 700ms, which feels near-instant. The user can interrupt the AI, and it stops speaking immediately. This is the difference between a "robot" and a "digital employee".

Ready to implement this?

Book a call with our engineering team to discuss your specific use case.