
Blogpost
Voice AI Latency. What To Measure? A Production Guide.
A production guide to voice-to-voice latency, with real numbers and code from an open-source LiveKit agent
TL;DR: Vendor-reported latency (~400ms) only measures LLM inference.
Real voice-to-voice latency is 1,200-1,800ms == 5x slower than natural conversation.
The biggest levers: persistent MCP connections (saves 400ms per tool call), tool response trimming (3s → 0.6s TTFT), prompt cache warm-up (54% faster first turn), and model-based turn detection (Deepgram Flux over Silero VAD).
This is the kind of explainer I wish we'd had when we first started building voice agents. Not "how to make voice AI faster", but what you should actually be measuring, why the numbers vendors give you are rubbish, and where the latency is really hiding.
Full agent code: GitHub.
Contents
Voice-to-voice latency
End-to-end breakdown
Tool calling latency
5 measurement mistakes
Prompt cache warm-up
Managed platform vs Developer Freedom
The latency budget
Voice-to-Voice Latency: The Only Metric That Matters
There's only one latency metric that actually maps to user experience: the time between when a caller stops speaking and when they hear the agent's voice start.
That's it. Voice-to-voice. Everything else is just internal detail.
Human conversation is tuned to work within a neurologically hardwired window of about 200-300ms - that's when a response feels natural, like the other person is actually listening.
The industry average for voice AI end-to-end latency, meanwhile, is stuck around 1,400-1,700ms. We're roughly 5 times slower than a natural conversation.
And the industry don't talk about voice-to-voice latency because no single vendor owns the whole chain.
Your STT provider will give you their numbers, your LLM provider will give you theirs and your TTS provider will give you theirs.
But nobody ever shows you the sum, because the sum is a pretty embarrassing number.
End-to-End Voice Agent Latency Breakdown
Here's our measured production latency breakdown (Deepgram STT, GPT-4.1-mini LLM, Cartesia TTS):

The LLM is the bottleneck. STT + TTS combined are ~500ms.
The LLM alone is 450-700ms on simple turns, and that's with a fast model (GPT-4.1-mini).
With tool calls, it gets a whole lot worse.
Here's the voice agent pipeline config that produces these numbers:
Two things to note: min_interruption_duration=1.0 means the user has to speak for at least a second before the agent will let them interrupt, this filters out the false interruptions from mic echo and background noise that we used to have all the time. And min_endpointing_delay=0.5 gives the STT half a second before declaring end-of-turn, it's all about getting the balance right between responsiveness and not cutting people off in mid-sentence.
There are ways to handle that more elegantly and i.e. new VAD models that LiveKit just released but that's for another material, here let's tackle the basics.
Tool Calling Latency: The Hidden Cost of Useful Voice Agents
The moment your voice agent actually does something useful (whether that's looking up an order, checking on some inventory, or adding items to a shopping cart) your latency budget gets absolutely screwed.
Here's what a single tool call really ends up costing you:

That's an extra 600-1,200ms of tool calling latency just tacked on top of the original response time, and it gets even worse because these things tend to compound without great memory compression etc.
A reorder workflow where the agent needs to check order history, look up stock levels, create a shopping cart, and then add some items can end up being 4 or 5 sequential tool calls.
The problem of connection overhead
Our first go at this was pretty rough, we opened a fresh connection to the MCP server for every single tool call.
And that added a whopping 500-800ms of overhead before we'd even started doing any real work.
The solution? Open the connection once when the voice session kicks off and then just hold onto it for all the tool calls.
We set it up at the same time as we're doing the rest of the session setup,waiting for the participant, creating the tracking session, and opening the MCP connection all happen concurrently:
500-800ms per tool call → 100-400ms. On a conversation with 5 tool calls, that's 2/ 3 seconds saved.
The problem of context bloat
But connections aren't the only issue. The responses from the tools are also a latency multiplier. Our e-commerce MCP server just returns the full product object: a massive 20+ fields per variant including all sorts of internal IDs, timestamps, SEO metadata, image URLs, tax info. A single list_products call dumps 3-5KB of JSON into the LLM context.
And after a few tool calls we're starting to see the LLM chew through 10,000+ tokens of raw JSON on every single subsequent turn. So the LLM TTFT goes from being around 500ms to being 3+ seconds.
The fix: trim the responses on the server side before we even send them to the agent. Each tool declares a responseShape, a declarative allowlist of fields we actually want to keep:
Result? The LLM sees a stripped back version of the full eCommerce object.
66% token reduction. TTFT dropped from 3s+ down to 0.6-1.8s.
The key decision: trim server-side, not agent-side. Having the shape defined per tool in the MCP server means every connected agent gets the trimmed responses automatically == you don't have to repeat the logic in every voice agent, email agent, or chat agent.
The wrong tool problem
Even which tools are available affects latency. More tools in the LLM context = more tokens per turn = slower TTFT. But the bigger issue is accuracy: with all 11 tools available at once, the LLM makes wrong selections.
Concrete example: during a reorder flow, the customer says "I need more of the Gris tiles." The LLM has product IDs from the order history it just retrieved. But it also has list_products available, so it searches the catalog by name instead of using get_product with the exact ID.
Wrong tool, slower, and sometimes returns the wrong product.
The fix is a guardrail that activates based on conversation state. And the context that arms it; when order history comes back, capture the product IDs.
This is a useful pattern to build on: scoping the available toolset to the current workflow stage reduces the number of tools, which in turn reduces the number of tokens and results in faster time to first token and more accurate tool selection- a double win for performance and accuracy.
Masking tool calling latency with fill lines
Even with persistent connections, trimmed responses, and all the other tricks in the book, tool calls still add anywhere from 600ms to 2 seconds. You can't get rid of that, but you can hide it.
The LLM says a short, specific fill line before calling the tool. LiveKit's AgentSession streams the TTS for that fill line while the tool executes in the background - no custom code needed, the framework handles the overlap naturally.
We enforce this entirely through the system prompt:
Each tool has its own specific fill line prescribed in the prompt.
No parallel dispatch at code level.
The fill line pattern is entirely prompt-enforced, and it works because the framework is already streaming TTS before the tool result comes back.
5 Voice AI Latency Measurement Mistakes
Only measuring happy-path turns
Turn 1 runs at about 500ms TTFT. Turn 8 after a reorder workflow runs at 3s or more.
It's context accumulation that's the root cause.
Early turns are fast the LLM has a system prompt and one user message.
Later turns, after three or four tool calls have added their responses to the conversation history, that's where the latency spikes live.
If you only benchmark the first few turns of a conversation, your numbers look great. Run through a full 5-tool reorder workflow and you'll see what we're really looking at.
Measuring components in isolation
STT vendors test with great audio and native speakers with no background noise. LLM vendors test with short prompts, no tools, no system prompt. TTS vendors test time-to-first-byte with single sentences.
Real-life production doesn't work like that.
your STT is transcribing audio that's been compressed through a phone codec.
your LLM has a multi-thousand-token system prompt with a dozen tool definitions and a growing conversation history.
your TTS is synthesising a response with order numbers, dates, and weird product names like "MSI Dimensions Gris, twelve by twenty four."
Ignoring VAD and turn detection entirely
Voice Activity Detection is the component that decides "the user has finished talking" AND is the silent latency killer
Silero VAD (first attempt): purely audio-based silence detection. Couldn't tell a thinking pause from a genuine end of utterance.
Deepgram Nova-3 endpointing: tunable through
endpointing_msandutterance_end_ms. Better, but still fundamentally silence-based.Deepgram Flux (current): a language model that predicts when the user has finished their thought, not just when they've stopped making sound.
I would try also the adaptive VAD from LiveKit if I'd build it again, the tests were super promising
Ignoring regional latency
This one came back to bite us. We ran our agent in the EU while our LLM, STT, and TTS providers were all back in the US.
That meant every turn was paying cross-region latency on 3-4 separate service calls
300–800ms per turn. Not because our models were slow, but because continents are really, really far apart.
The fix was actually pretty counterintuitive: co-locate the agent with your inference providers, not with the user. LiveKit's global edge network handles the WebRTC connection: a European user connects to a nearby edge node, which relays to the agent. Put the agent in US-East alongside OpenAI and Deepgram, and the user still gets a fast connection.
Testing at ridiculously low concurrency
Your voice agent is rock solid with 5 users at once. What happens when you get 500?
LLM rate limits break first :)
Prompt Cache Warm-Up: 54% Faster First Response
Of everything we tried, this prompt caching trick had the biggest impact for the least effort.
When a caller connects, the greeting plays for a couple of seconds - "Hey, thanks for calling BuildPro, this is Sam." That's dead time for the LLM. We fire a throwaway max_tokens=1 call with the full system prompt during this window.
54% time-to-first-token improvement on the first real turn in our initial tests, and 37% in later measurements.
That first turn is the user's first impression of whether the agent is responsive- and it costs us essentially nothing.
Why Per-Turn Control Matters: LiveKit / Pipecat vs Managed Platforms
Everything in this post - (persistent MCP connections, declarative response trimming, per-stage tool guardrails, prompt cache warm-up, sentence tokenizer tuning) - requires per-turn control over the conversation pipeline.
Take our asyncio.gather that sets up the session:
And the one that overlaps the greeting with prompt caching:
These aren't advanced patterns - they're just standard Python asyncio. But to use them, you have to own the session lifecycle. You need to intercept what the LLM sees, modify tool availability mid-conversation, inject session state the LLM doesn't manage directly, and react to tool results before they even reach the context.
That's what open frameworks like LiveKit Agents and Pipecat give you. Every turn goes through your code.
Managed platforms (Vapi, Retell, Bland) don't give you this level of control. You define the prompt, hook up the tools, and the platform manages the whole thing. That works fine for simple Q&A.
But when you start seeing real workflow complexity …. blocking tools after a specific event, trimming responses per-tool, warming caches during dead time- that's the the platform's ceiling.
The tradeoff: you build and maintain the orchestration yourself. At low volume and simple use cases, the managed overhead doesn't matter. At scale with complex workflows, per-turn control isn't a nice-to-have it's the difference between 1.2s and 3s+ voice-to-voice latency.
Start Measuring the Right Thing
Here's how we approach it: set a total end-to-end latency budget for voice-to-voice responses and allocate it across the voice agent pipeline.
The Sub-Second Voice AI Latency Budget

If one thing stands out from this whole post: instrument voice-to-voice latency as your primary metric, broken down by component.
Not LLM time to first token. Not STT word error rate. Not TTS MOS score. Those all matter, but they're components. The number your caller actually experiences is the end-to-end voice-to-voice delay.
Everything else is diagnosis.
The full agent code from this post is on GitHub - MIT licensed, with all the optimisations discussed here.
Blog
