Blog
Voice AI Latency: What to Actually Measure
Vendor latency numbers hide the real UX problem. This post breaks down voice-to-voice latency, tool-call overhead, and the engineering levers that actually matter.
Vendor latency numbers are the wrong metric
Every voice AI vendor will happily show you a latency number that sounds sharp on a slide. Usually it is some version of:
"Average response time: 400ms"
The problem is that this almost never describes the experience the caller actually feels.
That number is usually measuring only the LLM's time to first token. It does not include:
- turn detection
- speech-to-text time
- tool-call overhead
- text-to-speech startup
- transport overhead between components
If the caller experiences 1.2 to 1.8 seconds of dead air, the internal 400ms number is not the truth that matters.
The metric that maps to user experience
The metric worth caring about is simple:
voice-to-voice latency - the time between when the caller stops speaking and when they hear the agent begin speaking back.
That is the number users experience. Everything else is internal decomposition.
In practice, the parts stack up fast:
Turn detection ~150ms
STT 350-370ms
LLM 450-700ms
TTS 80-130ms
Transport overhead ~50ms
Total: ~1.2-1.8s before tool calls
Once tool calls enter the loop, the budget gets worse very quickly.
Tool calls are where useful agents get slow
The moment a voice agent starts doing work that matters - checking an order, looking up inventory, opening a ticket, taking a payment - latency stops being a model-only problem.
A single useful turn can involve:
- the model deciding to call a tool
- the tool round-trip
- the model processing the tool result
- text-to-speech starting again
That means a tool-heavy turn often adds another 600-1,200ms on top of the base response time.
The most common engineering mistakes here are not exotic:
- opening a fresh connection for every tool call
- dumping huge JSON blobs back into the model context
- treating every tool equally instead of scoping available actions by workflow stage
Two fixes that move the number materially
1. Keep MCP or tool connections warm
Opening a fresh connection on every call adds avoidable overhead. The better approach is to initialize the tool connection once at session start and reuse it through the whole call.
participant, session_id, mcp = await asyncio.gather(
ctx.wait_for_participant(),
_init_session(),
_init_mcp(),
)
That kind of change can shave hundreds of milliseconds from every tool call.
2. Trim tool responses before they hit the model
A surprising amount of latency comes from context bloat, not transport.
If your tool returns the full product object, internal IDs, timestamps, image URLs, SEO metadata, tax config, and other irrelevant fields, the model ends up re-reading junk on every later turn.
The fix is straightforward: return only the fields the agent actually needs.
export type ResponseShape = { [key: string]: true | ResponseShape };
This is one of those changes that improves both latency and tool selection quality at the same time.
The practical takeaway
If you want a voice agent that feels fast in production, optimize the whole path, not the prettiest metric in the vendor dashboard.
The useful questions are:
- What is our actual voice-to-voice latency?
- How much worse does it get after three or four tool calls?
- Which parts of the delay are structural, and which are just sloppy engineering?
- Are we measuring the caller's experience, or only the model's experience?
That is the difference between a demo that sounds fast in a benchmark clip and an agent that still feels sharp on a real call.
If you're building voice systems seriously, this is the layer worth getting honest about.