A hands-on production tutorial covering the full STT to LLM to TTS pipeline, the 800ms latency budget, barge-in, Twilio telephony, and the Pipecat versus LiveKit versus Vapi decision.
What a voice AI agent actually is in 2026
A voice AI agent is a real-time loop that listens to a caller, transcribes their speech, reasons over it with an LLM, and speaks a reply back fast enough to feel like a conversation. In 2026 the dominant pattern is still the cascaded pipeline: speech-to-text (STT) feeds a language model (LLM), whose tokens stream into text-to-speech (TTS). LiveKit reports that roughly 90% of production agents on its platform still use this three-vendor cascade rather than a single speech-native model, because the cascade lets you swap the best component at each layer.
The reason this is hard is not intelligence, it is timing. A human conversation has a turn-taking gap of around 200ms. Push the agent’s response past about 800ms and the interaction starts to feel like a bad VoIP call: people repeat themselves, talk over the bot, and hang up. Everything in this tutorial is organized around protecting that budget.
There are two architectural families. The cascade (STT to LLM to TTS) is modular, debuggable, and lets you pin a specific model at each stage. The speech-to-speech family (OpenAI Realtime, Gemini Live) collapses the pipeline into one model for lower latency and richer prosody, but you trade away per-layer control, function-calling reliability, and the ability to log a clean transcript. For most teams shipping a real product, the cascade is the right default in 2026, so that is what we build.

The latency budget: where every millisecond goes
<800ms
Natural conversation budget
End of user speech to first agent audio
150-300ms
STT finalization
Deepgram Nova-3 streaming range
75-200ms
TTS time-to-first-audio
Cartesia Sonic-3 to ElevenLabs Flash v2.5
~200ms
Human turn-taking gap
The bar you are trying to match
The target for a natural-feeling voice AI agent is under 800ms from the end of the user’s speech to the first audio of the reply, with sub-500ms considered the gold standard. You cannot hit that by luck; you hit it by assigning a millisecond budget to each stage and measuring against it. LiveKit’s own component targets are STT under 200ms, LLM time-to-first-token under 300ms, and TTS time-to-first-audio under 300ms.
The single biggest hidden cost is end-of-turn detection. A naive voice-activity-detection (VAD) silence timeout of 800ms adds nearly a full second to every response before the pipeline even starts processing. That is why turn detection (covered below) is the highest-leverage thing you can tune. The second biggest cost is choosing a batch-mode STT or TTS provider that cannot stream; streaming is non-negotiable in the hot path.
Think of the budget as a stacked bar. Network and VAD eat the first ~100ms, STT finalization another 150-300ms, the LLM’s first token 200-400ms, and TTS first audio 75-200ms. Anything that is not streaming, or any extra network hop, blows the bar past 800ms.
The most common cause of a laggy voice agent is not a slow model, it is an 800ms VAD silence timeout sitting in front of an otherwise fast pipeline. Tune turn detection first, then optimize models. And never put a batch-mode transcriber like base Whisper in the hot path; use it only as an offline fallback.
Choosing your STT, LLM, and TTS components
For most voice AI agents in 2026 the right stack is Deepgram for STT in the hot path, a small fast LLM like GPT-4.1 mini or Claude Haiku for turn responses, and Cartesia Sonic or ElevenLabs Flash for TTS. Each layer is chosen for streaming latency first, quality second.
On STT, Deepgram Nova-3 streams partial transcripts with roughly 200-400ms finalization and posts word error rates far below batch Whisper on clean audio (benchmarks cite 1.53% versus 6.82%). Deepgram’s newer Flux model is purpose-built for voice agents, with model-integrated end-of-turn detection so you do not have to bolt VAD on separately. OpenAI Whisper, by contrast, runs 1-3 seconds for a 10-second clip in batch mode, which is fine for offline transcription and useless in the live loop. Keep Whisper as a cheap async fallback, not your primary.
On TTS, the latency leaders are close. Cartesia Sonic-3 streams first audio in roughly 40-90ms thanks to its state-space-model architecture, which generates audio tokens in parallel rather than autoregressively. ElevenLabs Flash v2.5 lands around 75ms of pure inference and roughly 150ms end-to-end including network, with a deeper voice library. Pick Cartesia when raw latency wins; pick ElevenLabs when voice selection or cloning matters. On the LLM, resist the urge to use a frontier model for every turn: a fast mini model handles 90% of turns under your time-to-first-token budget, and you can route only complex turns to a larger model.




| Layer | Recommended | Approx. latency | Notes |
|---|---|---|---|
| STT | Deepgram Nova-3 / Flux | 200-400ms finalize | Flux adds native end-of-turn detection |
| STT fallback | OpenAI Whisper (batch) | 1-3s per clip | Offline transcripts only, not live |
| LLM | GPT-4.1 mini / Claude Haiku | 200-400ms TTFT | Route hard turns to a larger model |
| TTS (latency) | Cartesia Sonic-3 | 40-90ms TTFA | State-space model, parallel generation |
| TTS (voices) | ElevenLabs Flash v2.5 | ~150ms TTFA | Larger voice library, cloning |
Building the minimal loop in Python
The fastest way to a working voice AI agent is the LiveKit Agents framework, where an AgentSession wires STT, LLM, TTS, VAD, and turn detection in about a dozen lines. LiveKit Agents reached 1.0 in April 2025 and is on the 1.5.x line as of 2026, with adaptive interruption handling and native Model Context Protocol (MCP) tool support. Below is the canonical minimal agent, adapted from LiveKit’s February 2026 tutorial.
Note what the framework is doing for you: Silero VAD detects speech start and stop, a multilingual turn-detection model decides when the user has actually finished their turn, and the session automatically stops the agent’s speech when the user interrupts. You are configuring a pipeline, not writing an audio event loop by hand. The recommended stack here runs around $0.05 per minute of conversation.
API surfaces in this space change fast. The plugin string format, model names, and version pins below are accurate to early 2026 but will drift; always check the framework’s current docs before pinning a production version.
from dotenv import load_dotenv
from livekit import agents
from livekit.agents import AgentServer, AgentSession, Agent
from livekit.plugins import silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel
load_dotenv(".env.local")
class Assistant(Agent):
def __init__(self):
super().__init__(
instructions=(
"You are a helpful voice assistant. Keep responses "
"concise, ideally under 2 sentences. Be friendly."
)
)
server = AgentServer()
@server.rtc_session(agent_name="my-agent")
async def my_agent(ctx: agents.JobContext):
session = AgentSession(
stt="deepgram/nova-3:multi",
llm="openai/gpt-4.1-mini",
tts="cartesia/sonic-3",
vad=silero.VAD.load(), # ~85-100ms VAD
turn_detection=MultilingualModel(), # semantic end-of-turn
)
await session.start(room=ctx.room, agent=Assistant())
await session.generate_reply(
instructions="Greet the user and offer your assistance."
)
if __name__ == "__main__":
agents.cli.run_app(server)
Step 1 — Scaffold the project
Use uv for a fast, reproducible environment: run `uv init –bare`, then `uv add “livekit-agents[silero,turn-detector]~=1.4” python-dotenv`. The extras pull in the Silero VAD and the turn-detection model so you do not manage those dependencies by hand.Step 2 — Download model files
Run `uv run agent.py download-files`. This fetches the local Silero VAD weights and the turn-detector model so the first run does not stall on a download mid-call.Step 3 — Add credentials
Create a `.env.local` with your LiveKit URL and API key plus the provider keys (DEEPGRAM_API_KEY, OPENAI_API_KEY, CARTESIA_API_KEY). Using LiveKit Inference for the model strings means you can let LiveKit broker the provider calls, or supply keys directly.Step 4 — Talk to it locally
Run `uv run agent.py console` to test entirely in your terminal with your laptop mic and speakers, no browser or phone needed. When it feels right, switch to `uv run agent.py dev` to connect a real room and a frontend client.Turn detection and barge-in: making it feel human
Barge-in — letting the caller interrupt the agent mid-sentence — is the difference between a demo and a product, and you have under 200ms to detect speech, kill the TTS stream, and hand the floor back. The hard requirement, per practitioner guides, is to keep your turn-detection layer active even while the agent is playing audio, then cancel the current TTS stream and any in-flight LLM generation the instant real user speech is detected. Each sub-step (detect, stop audio, cancel LLM, yield) gets roughly 30-50ms.
There are three turn-detection strategies and the choice drives both latency and false interruptions. VAD-only is the simplest but adds latency proportional to your silence threshold, so it over-waits. STT endpointing uses the transcriber’s own end-of-utterance signal and is the best default for most production agents. Model-based detection reads the partial transcript and predicts completion from meaning, so it can fire before trailing silence even occurs — the lowest-latency option, but it needs tuning to avoid cutting people off mid-thought.
Two practical gotchas. First, client-side acoustic echo cancellation is mandatory, or the agent’s own voice leaking into the mic will trigger phantom barge-ins. Second, watch the cost of aggressive endpointing: Deepgram notes that turning on eager end-of-turn with Flux can cut latency but increase LLM usage 50-70% because you fire more speculative generations.
Start with STT endpointing as your default. Only move to model-based turn detection if measured latency is still too high after the easy wins. And always test barge-in on a real phone call, not just your laptop, because cellular jitter and echo behave nothing like a clean WebRTC room.
“Barge-in is not a feature you add at the end. It is the single behavior that decides whether your agent feels like a conversation or a voicemail menu.”
Surya Koritala, founder of Cyntr
Adding telephony with Twilio
To put your voice AI agent on a real phone number, you bridge Twilio Media Streams to your Python pipeline over a WebSocket — Twilio handles the PSTN call, your server handles the audio frames. Pipecat is the framework most teams reach for here because it models the call as a stream of frames (AudioFrame, TextFrame, UserStartedSpeakingFrame) flowing through a pipeline of processors, which maps cleanly onto Twilio’s media stream. Pipecat is on v1.3.0 as of late May 2026.
The flow is: an inbound call hits a TwiML webhook that tells Twilio to open a Media Stream to your FastAPI WebSocket endpoint. Your server reads the stream’s SID and call SID, builds a TwilioFrameSerializer with your account credentials, and constructs a FastAPIWebsocketTransport. That transport becomes the first and last processor in a Pipeline: transport-in, STT, LLM, TTS, transport-out. The one detail people miss is the sample rate — phone audio is 8kHz, so set audio_in_sample_rate and audio_out_sample_rate to 8000 in your PipelineParams or the agent will sound garbled.
Twilio’s own blog also documents a path using OpenAI’s Realtime API directly over the media stream if you want speech-to-speech instead of a cascade. Pipecat Cloud bills its built-in SIP at $0.005/min and PSTN at $0.018/min if you do not want to manage Twilio yourself.
# Pipecat + Twilio: serializer and sample-rate config
from pipecat.serializers.twilio import TwilioFrameSerializer
from pipecat.pipeline.task import PipelineParams
serializer = TwilioFrameSerializer(
stream_sid=stream_id,
call_sid=call_id,
account_sid=os.getenv("TWILIO_ACCOUNT_SID"),
auth_token=os.getenv("TWILIO_AUTH_TOKEN"),
)
params = PipelineParams(
audio_in_sample_rate=8000, # phone audio is 8kHz
audio_out_sample_rate=8000, # match it or audio is garbled
enable_metrics=True,
)
# Pipeline order: transport_in -> stt -> llm -> tts -> transport_out
Step 1 — Point Twilio at your webhook
In the Twilio console, set your phone number’s incoming-call webhook to your server’s `/twiml` endpoint. That endpoint returns TwiML with a `Step 2 — Accept the WebSocket and read the SIDs
On WebSocket connect, parse the first Twilio messages to extract the stream SID and call SID. Pass both into the TwilioFrameSerializer so outbound audio is tagged to the right call.Step 3 — Expose locally with a tunnel
For development, run an ngrok or cloudflared tunnel so Twilio can reach your local FastAPI server, then call the number and verify the round-trip on a real handset, including barge-in over cellular.Build vs buy: orchestration framework decision
The build-vs-buy line for a voice AI agent comes down to monthly minutes and engineering appetite: managed platforms like Vapi win below a few thousand minutes, while self-hosting on Pipecat or LiveKit pays back as volume grows. The trap is reading the headline orchestration fee and ignoring the pass-through model costs.
Vapi charges roughly $0.05/min for orchestration on a bring-your-own-key model, but real all-in costs land at $0.13-0.33/min once you add STT (~$0.01), LLM ($0.02-0.20), TTS (~$0.04), and telephony (~$0.01). Pure orchestrators are cheaper at the platform layer: LiveKit Cloud Agents and Pipecat Cloud both bill about $0.01/min and pass model costs through at vendor cost, with free tiers around 1,000 minutes/month. Self-hosting removes the platform fee entirely but adds engineering and ops time; analyses put the break-even for a CPU-only self-hosted stack in the low thousands to tens of thousands of minutes per month, assuming you have a capable engineer and keep the stack simple.
My rule of thumb: prototype on a managed platform to validate the use case in a week, then move the hot path to Pipecat or LiveKit once you can see your monthly minute volume. The frameworks are open source, so you are not locked in, and the cascade architecture means your STT, LLM, and TTS choices port over largely unchanged.
LiveKit Agents
Best for: Teams that want app + telephony, MCP tools, and a clean upgrade path to managed hosting.
What works
Watch out for
Pipecat
Best for: Phone bots and anyone who wants fine-grained control over the frame pipeline.
What works
Watch out for
Vapi
Best for: Solo builders and teams validating a use case before committing engineering.
What works
Watch out for
Pros
Cons
Production checklist and the verdict
Start cascaded, budget for 800ms, treat barge-in as a feature
Before a voice AI agent touches real callers, instrument latency per stage, test barge-in on real phone lines, and add an offline transcript fallback so you can debug what the agent actually heard. The gap between a demo that wows in a quiet room and a product that survives cellular jitter, background noise, and impatient humans is almost entirely operational discipline.
Concretely: enable per-stage metrics (Pipecat and LiveKit both expose them) and alert when any stage exceeds its budget. Run a batch transcriber like Whisper asynchronously on recorded audio so you have a ground-truth transcript independent of the hot-path STT. Build an evaluation harness that replays recorded calls and scores latency, interruption handling, and task success — do not eyeball it. And cap your LLM context tightly; long system prompts inflate time-to-first-token, which is the most visible part of your budget.
The fast-moving caveat bears repeating: model names, plugin signatures, and pricing in this space shift on a monthly cadence. Treat the specific code above as a starting template, pin your dependency versions, and re-check the framework docs before every production deploy.
Builder’s take
I build agent orchestration runtimes for a living at Cyntr, and voice is the hardest modality to get right because every millisecond is audible. The frameworks have matured enormously since 2024, but the failure modes are still mostly about latency discipline and turn-taking, not model quality.
- Treat the 800ms round-trip as a hard budget, not an aspiration. If you cannot name where every 100ms goes, your agent will feel laggy and users will talk over it.
- Start with a managed orchestrator (LiveKit, Pipecat) and a cascaded STT-LLM-TTS stack before you reach for speech-native models. Cascade gives you per-layer swappability, which you will need.
- Barge-in is a first-class feature, not a nice-to-have. Keep turn detection running while the agent speaks, and budget under 200ms to kill TTS when the user interrupts.
- Build versus buy hinges on minutes per month and whether you have an engineer who enjoys streaming audio. Below a few thousand minutes, Vapi-style platforms win on time-to-market; above that, self-hosting on Pipecat or LiveKit pays back.
Frequently asked questions
Deepgram Nova-3 or its agent-tuned Flux model is the default for the hot path, with roughly 200-400ms streaming finalization and word error rates far below batch Whisper on clean audio. Whisper is best kept as an offline, async fallback for ground-truth transcripts because its 1-3 second batch latency is too slow for a live conversation.
Aim for under 800ms from the end of the user’s speech to the first audio of the reply, with sub-500ms considered the gold standard. Component-level targets are STT under 200ms, LLM time-to-first-token under 300ms, and TTS time-to-first-audio under 300ms. The biggest hidden cost is end-of-turn detection, so tune that before optimizing models.
Both are excellent open-source Python frameworks. LiveKit Agents wires the whole pipeline in about a dozen lines and has strong WebRTC plus a managed cloud, making it great for app-and-phone agents. Pipecat’s frame-based pipeline shines for telephony and custom processor graphs. Choose LiveKit for the gentlest path to a working agent, Pipecat for fine-grained control over a phone bot.
Bridge Twilio Media Streams to your Python pipeline over a WebSocket. An inbound call hits a TwiML webhook that opens a media stream to your FastAPI WebSocket; you build a TwilioFrameSerializer with your account credentials and run a Pipecat pipeline at 8kHz sample rate. Pipecat Cloud also offers built-in SIP at $0.005/min and PSTN at $0.018/min if you prefer not to manage Twilio directly.
Barge-in is letting the caller interrupt the agent mid-sentence. It matters because it is the single behavior that separates a natural conversation from a rigid voicemail menu. You have under 200ms to detect the user’s speech, cancel the TTS stream and in-flight LLM generation, and hand the floor back. It requires keeping turn detection active during playback and using client-side echo cancellation to avoid phantom interruptions.
It depends on monthly minutes. Managed platforms like Vapi cost roughly $0.05/min for orchestration but $0.13-0.33/min all-in, and win below a few thousand minutes for speed-to-market. Self-hosting on Pipecat or LiveKit removes the platform fee (pure orchestrators bill about $0.01/min) and pays back as volume grows, with break-even often in the low thousands to tens of thousands of minutes per month if you have a capable engineer.
Primary sources
- Build Your First AI Voice Agent in Python — LiveKit
- Pipecat open-source framework (GitHub) — Pipecat / Daily
- Turn Detection for Voice Agents: VAD, Endpointing, Model-Based — LiveKit
- Deepgram vs Whisper in 2026 — OpenTypeless
- TTS Latency Benchmark 2026: TTFA Compared — Gradium
- Twilio WebSocket Integration — Pipecat
- Self-Hosted Voice Agents vs Vapi: Real Cost Analysis — Dograh
- 12 Voice Agent Platforms Compared — Softcery
- Deepgram Pricing — Deepgram
Last updated: May 31, 2026. Related: Agent Infrastructure.