Model Inferencing at Scale: How TinyAI Achieves Sub-200ms Latency

May 10, 2026 14 min read Engineering Model Inference

Why Latency Is Everything in Voice AI

In a text-based chatbot, 2 seconds of latency is acceptable. In a voice conversation, it's a dealbreaker. Humans perceive conversational pauses longer than 300ms as awkward. Anything above 500ms and the caller starts wondering if the line dropped.

Our target for model inferencing in voice AI: sub-200ms end-to-end, measured from the moment the caller finishes speaking to the moment they hear the agent's first word. That 200ms budget covers:

ASR (speech-to-text): ~40-60ms
LLM inference (the "thinking"): ~60-80ms
Tool calls (API lookups, CRM queries): ~20-40ms
TTS (text-to-speech): ~30-50ms
Network and audio buffering: ~10-20ms

Every millisecond matters. Here's how we hit our targets.

The Inference Stack

Layer 1: Model Architecture and Quantization

The single biggest lever in AI model inference optimization is model size. A 70B parameter model, even with aggressive optimization, will never hit 80ms inference on affordable hardware. That's why we start with a 3B parameter base model and fine-tune it on customer data.

But even a 3B model needs optimization for real-time voice:

GPTQ 4-bit quantization — Reduces model memory footprint by 75% with less than 1% accuracy loss on domain-specific tasks. A 3B model goes from ~6GB to ~1.5GB, fitting entirely in GPU VRAM.
KV-cache optimization — We use paged attention (inspired by vLLM) to efficiently manage the key-value cache across concurrent conversations, reducing memory fragmentation.
Speculative decoding — A tiny draft model (300M params) generates candidate tokens in parallel, which the main model verifies in a single forward pass. This gives us 2-3x faster token generation with zero quality loss.

The result: our fine-tuned 3B TLM generates a full response (~40-60 tokens for a typical voice turn) in 60-80ms on a single consumer GPU.

Layer 2: Serving Infrastructure

The model is just one piece. The serving layer around it determines whether that raw inference speed translates to actual user-perceived latency.

Our model serving stack:

Continuous batching — Instead of waiting for a batch to fill, we add new requests to the running batch on every iteration. This keeps GPU utilization above 85% while maintaining sub-100ms queue time.
Streaming inference — We don't wait for the full response to generate. Tokens stream to the TTS engine as they're produced, so speech synthesis starts while the model is still generating.
Warm model pools — Customer-specific models are pre-loaded in GPU memory during business hours. No cold starts. Model swapping takes ~200ms (which we do during call silence periods).
Async tool execution — When the model decides to call a tool (e.g., look up an order), the tool call fires asynchronously. If the model is generating a preamble like "Let me check that for you...", the API call runs in parallel.

Layer 3: ASR Optimization

Automatic Speech Recognition is the first bottleneck in voice AI inference. The caller's audio needs to be transcribed to text before the LLM can process it.

Our approach:

Streaming ASR — We don't wait for the caller to finish their entire sentence. Partial transcripts stream to the LLM in real time, so the model can begin "thinking" before the caller stops speaking.
Endpointing — A lightweight voice activity detector (VAD) determines when the caller has finished speaking. We tuned this aggressively for Indian conversational patterns — shorter pauses, more filler words, frequent code-switching.
Fine-tuned Whisper — We run a distilled, fine-tuned Whisper model optimized for Indian English and Hindi. Runs in ~40ms for a typical 3-5 second utterance on a T4 GPU.

Layer 4: TTS Streaming

The last mile: converting the model's text response into natural-sounding speech. Traditional TTS generates the entire audio file, then plays it. We don't have that luxury.

Chunk-based synthesis — As tokens stream from the LLM, we synthesize audio in small chunks (50-100ms). The first audio chunk plays while subsequent chunks are still being generated.
Voice cloning per customer — Each deployment has a custom voice profile tuned for the brand. Warm, professional, and natural — not robotic.
Prosody matching — Our TTS model adjusts tone, pace, and emphasis based on the content. Questions sound like questions. Confirmations sound confident.

The Full Pipeline: A Latency Breakdown

Here's an actual production trace from a logistics client deployment — a typical "where is my order?" call:

Stage	Operation	Latency
ASR	Streaming transcription (Hindi + English)	48ms
LLM	Intent classification + response generation	67ms
Tool	Order API lookup (async, overlapped with preamble)	32ms*
TTS	First audio chunk synthesis	35ms
Network	Audio buffering + SIP transport	12ms
Total	End-to-end (first audio byte)	162ms

*Tool call runs in parallel with preamble generation, so its latency is mostly hidden.

P95 across all calls: 182ms. P50: 141ms. Faster than most human agents pick up the phone.

Cost Optimization: The Other Side of Inference

AI inference cost is the hidden killer of voice AI economics. Here's how the math works at scale:

A 70B model on an A100 GPU: ~$2.50/hour, serving ~15 concurrent calls
Our 3B quantized model on a T4 GPU: ~$0.30/hour, serving ~40 concurrent calls

That's a 22x cost difference per conversation. And because our fine-tuned 3B model matches or exceeds the 70B model's accuracy on domain-specific tasks, there's no quality trade-off.

The key insight: model inferencing cost scales with model size, not with model quality on specific tasks. A well-fine-tuned small model is both cheaper AND better than a generic large model for defined use cases.

GPU Fleet Management

We run a mixed GPU fleet to optimize cost further:

T4 GPUs for steady-state inference — best cost/performance ratio for our quantized 3B models
L4 GPUs for burst capacity — spun up during peak hours, spot instances for non-critical overflow
CPU inference for lightweight models (VAD, intent classification, entity extraction) — no GPU needed for these

Auto-scaling based on call queue depth, not GPU utilization. If calls are waiting, we scale up. If GPUs are idle, we scale down. Response time: 30 seconds from scale trigger to first inference.

Edge Inference: The Next Frontier

Cloud inference works for most use cases, but some scenarios demand edge AI inference:

Zero-latency environments — Factory floors where even 20ms of network latency is too much
Air-gapped deployments — Government and defense clients with no cloud connectivity
Bandwidth-constrained locations — Rural healthcare centers with intermittent connectivity
Data sovereignty — Scenarios where raw audio cannot leave the premises

We're actively deploying our quantized models on NVIDIA Jetson Orin devices for edge inference. The 3B model runs at ~120ms inference on Jetson Orin NX — fast enough for real-time voice, entirely offline.

What We've Learned

After running model inferencing at scale for over a year, our key takeaways:

Quantization is free lunch — 4-bit quantization on fine-tuned models loses less than 1% accuracy on domain tasks. There's no reason not to do it.
Streaming everything — Don't wait for any stage to complete before starting the next. Pipeline parallelism is the single biggest latency win.
Fine-tuning beats scale — A 3B model with 6 months of customer data beats a 70B model with zero customer data. Every time, on every metric.
Latency budgets are real — Set a p95 target and measure every stage. You can't optimize what you don't measure.
Inference cost is the real moat — Anyone can fine-tune a model. Running it at scale, profitably, at low latency — that's the hard part.

"The goal isn't to run the biggest model. It's to run the right model, fast enough and cheap enough that every business can afford it."

How TinyAI Is Disrupting Voice AI in India with the Lowest Prices in the Industry — The founder story and business case behind our affordable voice AI platform.
AI Image Analytics: How Computer Vision Is Transforming Business in 2026 — Applying the same small-model philosophy to computer vision and image analytics.

Want to see our inference stack in action?

Book a demo and hear a tinyAgent handle a real conversation at sub-200ms latency.

Book a demo