TinyAI TinyAI
← Back to Blog

Model Inferencing at Scale: How TinyAI Achieves Sub-200ms Latency

A technical deep dive into the inference stack behind our voice AI agents — from quantization to batching to the last millisecond of TTS.

May 10, 2026 14 min read Engineering Model Inference

Why Latency Is Everything in Voice AI

In a text-based chatbot, 2 seconds of latency is acceptable. In a voice conversation, it's a dealbreaker. Humans perceive conversational pauses longer than 300ms as awkward. Anything above 500ms and the caller starts wondering if the line dropped.

Our target for model inferencing in voice AI: sub-200ms end-to-end, measured from the moment the caller finishes speaking to the moment they hear the agent's first word. That 200ms budget covers:

Every millisecond matters. Here's how we hit our targets.

The Inference Stack

Layer 1: Model Architecture and Quantization

The single biggest lever in AI model inference optimization is model size. A 70B parameter model, even with aggressive optimization, will never hit 80ms inference on affordable hardware. That's why we start with a 3B parameter base model and fine-tune it on customer data.

But even a 3B model needs optimization for real-time voice:

The result: our fine-tuned 3B TLM generates a full response (~40-60 tokens for a typical voice turn) in 60-80ms on a single consumer GPU.

Layer 2: Serving Infrastructure

The model is just one piece. The serving layer around it determines whether that raw inference speed translates to actual user-perceived latency.

Our model serving stack:

Layer 3: ASR Optimization

Automatic Speech Recognition is the first bottleneck in voice AI inference. The caller's audio needs to be transcribed to text before the LLM can process it.

Our approach:

Layer 4: TTS Streaming

The last mile: converting the model's text response into natural-sounding speech. Traditional TTS generates the entire audio file, then plays it. We don't have that luxury.

The Full Pipeline: A Latency Breakdown

Here's an actual production trace from a logistics client deployment — a typical "where is my order?" call:

Stage Operation Latency
ASR Streaming transcription (Hindi + English) 48ms
LLM Intent classification + response generation 67ms
Tool Order API lookup (async, overlapped with preamble) 32ms*
TTS First audio chunk synthesis 35ms
Network Audio buffering + SIP transport 12ms
Total End-to-end (first audio byte) 162ms

*Tool call runs in parallel with preamble generation, so its latency is mostly hidden.

P95 across all calls: 182ms. P50: 141ms. Faster than most human agents pick up the phone.

Cost Optimization: The Other Side of Inference

AI inference cost is the hidden killer of voice AI economics. Here's how the math works at scale:

That's a 22x cost difference per conversation. And because our fine-tuned 3B model matches or exceeds the 70B model's accuracy on domain-specific tasks, there's no quality trade-off.

The key insight: model inferencing cost scales with model size, not with model quality on specific tasks. A well-fine-tuned small model is both cheaper AND better than a generic large model for defined use cases.

GPU Fleet Management

We run a mixed GPU fleet to optimize cost further:

Auto-scaling based on call queue depth, not GPU utilization. If calls are waiting, we scale up. If GPUs are idle, we scale down. Response time: 30 seconds from scale trigger to first inference.

Edge Inference: The Next Frontier

Cloud inference works for most use cases, but some scenarios demand edge AI inference:

We're actively deploying our quantized models on NVIDIA Jetson Orin devices for edge inference. The 3B model runs at ~120ms inference on Jetson Orin NX — fast enough for real-time voice, entirely offline.

What We've Learned

After running model inferencing at scale for over a year, our key takeaways:

  1. Quantization is free lunch — 4-bit quantization on fine-tuned models loses less than 1% accuracy on domain tasks. There's no reason not to do it.
  2. Streaming everything — Don't wait for any stage to complete before starting the next. Pipeline parallelism is the single biggest latency win.
  3. Fine-tuning beats scale — A 3B model with 6 months of customer data beats a 70B model with zero customer data. Every time, on every metric.
  4. Latency budgets are real — Set a p95 target and measure every stage. You can't optimize what you don't measure.
  5. Inference cost is the real moat — Anyone can fine-tune a model. Running it at scale, profitably, at low latency — that's the hard part.
"The goal isn't to run the biggest model. It's to run the right model, fast enough and cheap enough that every business can afford it."

Related Posts

Want to see our inference stack in action?

Book a demo and hear a tinyAgent handle a real conversation at sub-200ms latency.

Book a demo