Voice AI Latency: What to Actually Measure

Vendor latency numbers are the wrong metric

Every voice AI vendor will happily show you a latency number that sounds sharp on a slide. Usually it is some version of:

"Average response time: 400ms"

The problem is that this almost never describes the experience the caller actually feels.

That number is usually measuring only the LLM's time to first token. It does not include:

turn detection
speech-to-text time
tool-call overhead
text-to-speech startup
transport overhead between components

If the caller experiences 1.2 to 1.8 seconds of dead air, the internal 400ms number is not the truth that matters.

The metric that maps to user experience

The metric worth caring about is simple:

voice-to-voice latency - the time between when the caller stops speaking and when they hear the agent begin speaking back.

That is the number users experience. Everything else is internal decomposition.

In practice, the parts stack up fast:

Turn detection      ~150ms
STT                 350-370ms
LLM                 450-700ms
TTS                 80-130ms
Transport overhead  ~50ms

Total: ~1.2-1.8s before tool calls

Once tool calls enter the loop, the budget gets worse very quickly.

Tool calls are where useful agents get slow

The moment a voice agent starts doing work that matters - checking an order, looking up inventory, opening a ticket, taking a payment - latency stops being a model-only problem.

A single useful turn can involve:

the model deciding to call a tool
the tool round-trip
the model processing the tool result
text-to-speech starting again

That means a tool-heavy turn often adds another 600-1,200ms on top of the base response time.

The most common engineering mistakes here are not exotic:

opening a fresh connection for every tool call
dumping huge JSON blobs back into the model context
treating every tool equally instead of scoping available actions by workflow stage

Two fixes that move the number materially

1. Keep MCP or tool connections warm

Opening a fresh connection on every call adds avoidable overhead. The better approach is to initialize the tool connection once at session start and reuse it through the whole call.

participant, session_id, mcp = await asyncio.gather(
    ctx.wait_for_participant(),
    _init_session(),
    _init_mcp(),
)

That kind of change can shave hundreds of milliseconds from every tool call.

2. Trim tool responses before they hit the model

A surprising amount of latency comes from context bloat, not transport.

If your tool returns the full product object, internal IDs, timestamps, image URLs, SEO metadata, tax config, and other irrelevant fields, the model ends up re-reading junk on every later turn.

The fix is straightforward: return only the fields the agent actually needs.

export type ResponseShape = { [key: string]: true | ResponseShape };

This is one of those changes that improves both latency and tool selection quality at the same time.

The practical takeaway

If you want a voice agent that feels fast in production, optimize the whole path, not the prettiest metric in the vendor dashboard.

The useful questions are:

What is our actual voice-to-voice latency?
How much worse does it get after three or four tool calls?
Which parts of the delay are structural, and which are just sloppy engineering?
Are we measuring the caller's experience, or only the model's experience?

That is the difference between a demo that sounds fast in a benchmark clip and an agent that still feels sharp on a real call.

If you're building voice systems seriously, this is the layer worth getting honest about.