Why Latency Is Everything in Voice AI
In a text-based chatbot, 2 seconds of latency is acceptable. In a voice conversation, it's a dealbreaker. Humans perceive conversational pauses longer than 300ms as awkward. Anything above 500ms and the caller starts wondering if the line dropped.
Our target for model inferencing in voice AI: sub-200ms end-to-end, measured from the moment the caller finishes speaking to the moment they hear the agent's first word. That 200ms budget covers:
- ASR (speech-to-text): ~40-60ms
- LLM inference (the "thinking"): ~60-80ms
- Tool calls (API lookups, CRM queries): ~20-40ms
- TTS (text-to-speech): ~30-50ms
- Network and audio buffering: ~10-20ms
Every millisecond matters. Here's how we hit our targets.
The Inference Stack
Layer 1: Model Architecture and Quantization
The single biggest lever in AI model inference optimization is model size. A 70B parameter model, even with aggressive optimization, will never hit 80ms inference on affordable hardware. That's why we start with a 3B parameter base model and fine-tune it on customer data.
But even a 3B model needs optimization for real-time voice:
- GPTQ 4-bit quantization — Reduces model memory footprint by 75% with less than 1% accuracy loss on domain-specific tasks. A 3B model goes from ~6GB to ~1.5GB, fitting entirely in GPU VRAM.
- KV-cache optimization — We use paged attention (inspired by vLLM) to efficiently manage the key-value cache across concurrent conversations, reducing memory fragmentation.
- Speculative decoding — A tiny draft model (300M params) generates candidate tokens in parallel, which the main model verifies in a single forward pass. This gives us 2-3x faster token generation with zero quality loss.
The result: our fine-tuned 3B TLM generates a full response (~40-60 tokens for a typical voice turn) in 60-80ms on a single consumer GPU.
Layer 2: Serving Infrastructure
The model is just one piece. The serving layer around it determines whether that raw inference speed translates to actual user-perceived latency.
Our model serving stack:
- Continuous batching — Instead of waiting for a batch to fill, we add new requests to the running batch on every iteration. This keeps GPU utilization above 85% while maintaining sub-100ms queue time.
- Streaming inference — We don't wait for the full response to generate. Tokens stream to the TTS engine as they're produced, so speech synthesis starts while the model is still generating.
- Warm model pools — Customer-specific models are pre-loaded in GPU memory during business hours. No cold starts. Model swapping takes ~200ms (which we do during call silence periods).
- Async tool execution — When the model decides to call a tool (e.g., look up an order), the tool call fires asynchronously. If the model is generating a preamble like "Let me check that for you...", the API call runs in parallel.
Layer 3: ASR Optimization
Automatic Speech Recognition is the first bottleneck in voice AI inference. The caller's audio needs to be transcribed to text before the LLM can process it.
Our approach:
- Streaming ASR — We don't wait for the caller to finish their entire sentence. Partial transcripts stream to the LLM in real time, so the model can begin "thinking" before the caller stops speaking.
- Endpointing — A lightweight voice activity detector (VAD) determines when the caller has finished speaking. We tuned this aggressively for Indian conversational patterns — shorter pauses, more filler words, frequent code-switching.
- Fine-tuned Whisper — We run a distilled, fine-tuned Whisper model optimized for Indian English and Hindi. Runs in ~40ms for a typical 3-5 second utterance on a T4 GPU.
Layer 4: TTS Streaming
The last mile: converting the model's text response into natural-sounding speech. Traditional TTS generates the entire audio file, then plays it. We don't have that luxury.
- Chunk-based synthesis — As tokens stream from the LLM, we synthesize audio in small chunks (50-100ms). The first audio chunk plays while subsequent chunks are still being generated.
- Voice cloning per customer — Each deployment has a custom voice profile tuned for the brand. Warm, professional, and natural — not robotic.
- Prosody matching — Our TTS model adjusts tone, pace, and emphasis based on the content. Questions sound like questions. Confirmations sound confident.
The Full Pipeline: A Latency Breakdown
Here's an actual production trace from a logistics client deployment — a typical "where is my order?" call:
| Stage | Operation | Latency |
|---|---|---|
| ASR | Streaming transcription (Hindi + English) | 48ms |
| LLM | Intent classification + response generation | 67ms |
| Tool | Order API lookup (async, overlapped with preamble) | 32ms* |
| TTS | First audio chunk synthesis | 35ms |
| Network | Audio buffering + SIP transport | 12ms |
| Total | End-to-end (first audio byte) | 162ms |
*Tool call runs in parallel with preamble generation, so its latency is mostly hidden.
P95 across all calls: 182ms. P50: 141ms. Faster than most human agents pick up the phone.
Cost Optimization: The Other Side of Inference
AI inference cost is the hidden killer of voice AI economics. Here's how the math works at scale:
- A 70B model on an A100 GPU: ~$2.50/hour, serving ~15 concurrent calls
- Our 3B quantized model on a T4 GPU: ~$0.30/hour, serving ~40 concurrent calls
That's a 22x cost difference per conversation. And because our fine-tuned 3B model matches or exceeds the 70B model's accuracy on domain-specific tasks, there's no quality trade-off.
The key insight: model inferencing cost scales with model size, not with model quality on specific tasks. A well-fine-tuned small model is both cheaper AND better than a generic large model for defined use cases.
GPU Fleet Management
We run a mixed GPU fleet to optimize cost further:
- T4 GPUs for steady-state inference — best cost/performance ratio for our quantized 3B models
- L4 GPUs for burst capacity — spun up during peak hours, spot instances for non-critical overflow
- CPU inference for lightweight models (VAD, intent classification, entity extraction) — no GPU needed for these
Auto-scaling based on call queue depth, not GPU utilization. If calls are waiting, we scale up. If GPUs are idle, we scale down. Response time: 30 seconds from scale trigger to first inference.
Edge Inference: The Next Frontier
Cloud inference works for most use cases, but some scenarios demand edge AI inference:
- Zero-latency environments — Factory floors where even 20ms of network latency is too much
- Air-gapped deployments — Government and defense clients with no cloud connectivity
- Bandwidth-constrained locations — Rural healthcare centers with intermittent connectivity
- Data sovereignty — Scenarios where raw audio cannot leave the premises
We're actively deploying our quantized models on NVIDIA Jetson Orin devices for edge inference. The 3B model runs at ~120ms inference on Jetson Orin NX — fast enough for real-time voice, entirely offline.
What We've Learned
After running model inferencing at scale for over a year, our key takeaways:
- Quantization is free lunch — 4-bit quantization on fine-tuned models loses less than 1% accuracy on domain tasks. There's no reason not to do it.
- Streaming everything — Don't wait for any stage to complete before starting the next. Pipeline parallelism is the single biggest latency win.
- Fine-tuning beats scale — A 3B model with 6 months of customer data beats a 70B model with zero customer data. Every time, on every metric.
- Latency budgets are real — Set a p95 target and measure every stage. You can't optimize what you don't measure.
- Inference cost is the real moat — Anyone can fine-tune a model. Running it at scale, profitably, at low latency — that's the hard part.
"The goal isn't to run the biggest model. It's to run the right model, fast enough and cheap enough that every business can afford it."
Related Posts
- How TinyAI Is Disrupting Voice AI in India with the Lowest Prices in the Industry — The founder story and business case behind our affordable voice AI platform.
- AI Image Analytics: How Computer Vision Is Transforming Business in 2026 — Applying the same small-model philosophy to computer vision and image analytics.
Want to see our inference stack in action?
Book a demo and hear a tinyAgent handle a real conversation at sub-200ms latency.
Book a demo