Why Your Voice AI Says It Can't Talk (And How to Fix It)

The visual explainer I wish I had earlier

It's common for stakeholders to get confused about voice AI architecture.

What does it even mean that you "build the foundation in chat, then add the voice agent layer"?

I tried to visualize it in the simplest way possible. This is the explainer I wish existed when I started.

Follow along and you'll see exactly how audio becomes text, how thought becomes speech, where the LLM sits blind in the middle, and why one line in your system prompt changes everything.

Whether you're building or buying, this is how voice AI actually works under the hood.

The WTF moment in voice AI architecture

You just built your first voice AI agent. Microphone working. Speaker working. You can hear it talking.

So, as a customer interacting with the agent, you ask the obvious question:

"Hey, can you hear me?"

And the AI says:

"I can see your text message, but I should clarify that I'm a text-based AI assistant. I can't actually hear audio or voice. I can only read and respond to the text you type to me."

It said this out loud. With its voice. While literally hearing me speak.

What the hell?

The classic failure mode: a voice agent speaking out loud while insisting it is text-only.

Here's what's actually happening.

When you build a voice AI system, you're usually assembling a chain of specialized models. Each one does a different job:

Your voice -> [Microphone] -> [Speech-to-Text]
-> [LLM Brain] -> [Text-to-Speech] -> [Speaker]

This is called a cascaded architecture.

In 2026, most production voice systems still use this setup, even though unified speech-to-speech models are getting better. Direct speech-to-speech is promising, but chaining specialized models is still the standard when you care about controllability and accuracy.

Why?

You can apply tighter guardrails in the prompt and orchestration layer.
You can constrain the agent to a defined knowledge base.
You can choose different models for latency, quality, and cost at each step.

The tradeoff is latency. The user's speech gets transcribed, processed by a language model, then synthesized back into audio. Every layer adds time.

The LLM sits in the middle. It receives text. It outputs text. That's it.

It has no native awareness:

that your voice was transcribed into the text it is reading
that its response will be spoken aloud
that it is part of a real-time conversation
that it effectively can hear and speak through the surrounding pipeline

The LLM is a brain in a jar. You gave it ears with speech-to-text and a mouth with text-to-speech, but nobody told the brain.

The one-line fix: tell the model what it is

The solution is almost embarrassingly simple. You need to tell your AI what it is.

Before:

You are a helpful assistant.

After:

You are a voice assistant powered by Pipecat.
You ARE having a real-time voice conversation with the user.
Their speech is transcribed and sent to you. Your responses are spoken aloud.
Keep responses brief and conversational - you're talking, not writing.
Don't say you can't hear - you can, through speech-to-text.

Same question. Completely different response.

"Yes, I can hear you perfectly. And yes, I'm talking to you right now. How are you doing today?"

That's it. The system prompt is your AI's self-awareness.

Same pipeline, different prompt, dramatically better behavior.

What the full stack actually looks like

Now that the why is clear, here's the how.

┌─────────────────────────────────────────────────┐
│                     YOU                         │
│              Speaking into mic                  │
└────────────────────────┬────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────┐
│                  TRANSPORT                      │
│                  (WebRTC)                       │
│                                                 │
│  The "phone line" - moves audio, understands    │
│  nothing                                        │
└────────────────────────┬────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────┐
│                    VAD                          │
│         (Voice Activity Detection)              │
│                                                 │
│  "Is someone speaking, or is this just noise?"  │
│  Runs locally, <1ms, no API calls               │
└────────────────────────┬────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────┐
│                    STT                          │
│         (Speech-to-Text: Deepgram)              │
│                                                 │
│  THE EARS - converts audio to text              │
│  "Hello can you hear me" -> TranscriptionFrame  │
└────────────────────────┬────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────┐
│            CONTEXT AGGREGATOR                   │
│                                                 │
│  Collects conversation history and formats it   │
│  for the LLM. Tracks what the user said and     │
│  what the bot said.                             │
└────────────────────────┬────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────┐
│                    LLM                          │
│            (The Brain: Claude)                  │
│                                                 │
│  THE ONLY PART THAT "THINKS"                    │
│  Receives text -> generates text response       │
│  Has NO IDEA it's in a voice pipeline           │
│  unless you tell it                             │
└────────────────────────┬────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────┐
│                    TTS                          │
│        (Text-to-Speech: ElevenLabs)             │
│                                                 │
│  THE MOUTH - converts text to audio             │
│  Can stream word-by-word instead of waiting     │
│  for the full response                          │
└────────────────────────┬────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────┐
│                  TRANSPORT                      │
│               (output side)                     │
│                                                 │
│  Sends audio back through WebRTC to your        │
│  speaker                                        │
└────────────────────────┬────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────┐
│                     YOU                         │
│            Hearing the response                 │
└─────────────────────────────────────────────────┘

The key insight is simple: the LLM only sees the middle. It gets text in and sends text out. The audio capture, transcription, and synthesis all happen around it.

Latency is a conversation killer

Each component adds delay:

VAD       <1ms (local)
STT       100-300ms
LLM       500-2000ms
TTS       100-500ms
Network   50-200ms

Total: 750ms - 3 seconds before the user hears a response

This is why:

token streaming matters, because TTS can start before the LLM is fully done
model choice matters, because voice favors fast responses over essay-quality output
TTS choice matters, because synthesis startup time changes the whole feel of the call
response length matters, because shorter replies sound faster even when the models stay the same

Turn-taking is what makes voice feel natural

The best conversations flow. You talk, the other person listens, then they jump in at the right time. No awkward overlap. No weird silence.

That is exactly what strong voice agents are trying to reproduce.

For that to work, the system needs to know when you've finished speaking. That is where turn-taking comes in.

Get turn-taking right, and the agent feels almost human. Perceived latency drops, interruptions feel natural, and users stop fighting the system.

Get it wrong, and the whole thing feels robotic.

The traditional approach: VAD and its limits

Most voice pipelines use a lightweight model that listens for silence. When you stop making sound, it assumes you're done talking.

The problem is that silence is not the same thing as completion.

If you say, "I need to book a flight to..." and pause to think, basic VAD may decide the turn is over and fire the response too early. Now the agent interrupts you, you're both talking at once, and the conversation breaks.

Traditional stacks try to patch this with extra layers: VAD for silence, another model for endpointing, maybe a timeout buffer on top. It works sometimes, but it is fragile and full of tuning.

How Deepgram Flux changes the tradeoff

Deepgram Flux takes a different approach.

Instead of bolting turn detection onto transcription, it fuses them together. The same model that transcribes the words is also modeling conversational flow.

That means it can combine:

acoustic signals like prosody, pause duration, and rhythm
semantic signals like sentence completion, intent, and conversational structure

So when you say, "I was thinking about..." and trail off, the system can infer that the thought is incomplete even if there is silence. But when you say, "Thanks so much," with a closed sentence and falling intonation, it can infer that the turn is done.

I'm bullish on this right now and testing it heavily.

If you want voice AI to sound natural, smooth turn-taking is not optional. It is one of the biggest differences between a useful conversational agent and a demo that feels off after thirty seconds.

Voice AI is five technologies, not one

Voice AI is not one model. It is five systems working together:

WebRTC for transport
VAD for detecting speech
STT for hearing
LLM for thinking
TTS for speaking

The LLM is the star, but it is also the most clueless component in the stack. It does not know where the input came from or where the output is going. It thinks it is typing in a chat window.

Your job as a builder is not just to connect the pieces. It is to give the brain a coherent sense of self.

The system prompt is not just a set of instructions. It is the AI's understanding of its own existence.

How Pipecat handles the plumbing

Pipecat is an open-source framework that orchestrates exactly this kind of pipeline.

Instead of manually wiring together WebRTC, VAD, STT, LLM, and TTS, then debugging all the async behavior between them, Pipecat lets you define the flow declaratively:

pipeline = Pipeline(
    [
        transport.input(),               # Audio in (WebRTC)
        stt,                             # Ears: Speech -> Text
        transcript.user(),               # Log what the user said
        context_aggregator.user(),       # Add to conversation history
        llm,                             # Brain: Think
        tts,                             # Mouth: Text -> Speech
        transport.output(),              # Audio out (WebRTC)
        transcript.assistant(),          # Log what the bot said
        context_aggregator.assistant(),  # Track the full conversation
    ]
)

Read that top to bottom. That is literally the order the signal moves through the system.

Each component is a processor that takes frames in and pushes frames out. Pipecat handles the streaming, the async coordination, and the interruption logic.

Your job is to configure the components and write the system prompt. The framework handles the plumbing.

The foundation

Once this foundation is in place, you can add:

function calling so the agent can check weather, look up orders, or book appointments
interruption tuning so the agent responds cleanly in messy real conversations
custom voices with providers like ElevenLabs
phone integrations by swapping WebRTC for Twilio or other telephony layers

But all of it builds on the same core idea:

The LLM is just the brain, and you have to tell it about its body.