VoicERA — Enabling Digital Inclusion in Every Spoken Language

Why 2 Seconds Is the Threshold

Conversations have a rhythm. In phone calls, the acceptable pause before a response — the one that still feels like a flowing conversation rather than a broken one — is approximately 1.5 to 2.5 seconds. Beyond 3 seconds, callers start saying "Hello? Are you there?" and the interaction collapses.

This is not a VoicERA design choice. It is a human neurological constant.

So when we set out to build VoicERA, we had a hard constraint: the full pipeline from end of speech to first audio out had to be under 2 seconds. Not as an average. As a p95.

The Pipeline Breakdown

The pipeline has five stages:

1. Call Arrival & Streaming (< 50ms)

Phone call audio is streamed in real-time using WebRTC via Pipecat. There is no "record and upload" step. Audio chunks arrive at the gateway as they are spoken.

2. Speech-to-Text (< 100ms to first word)

We use AI4Bharat's IndicWav2Vec or IndicWhisper depending on language and context. The key optimisation here is streaming recognition: we don't wait for the caller to finish speaking. We begin recognising words as they arrive and start processing partial hypotheses.

The first word is recognised in under 100ms. The full utterance is finalised within 50ms of the caller stopping speech.

3. LLM Inference (< 600ms)

The recognised text is sent to the LLM. We use a combination of:

RAG (Retrieval Augmented Generation) for knowledge base queries — a hybrid dense/sparse retriever with sub-200ms retrieval
Streaming token generation — we don't wait for the full response. We begin generating TTS audio as soon as we have the first sentence.

This is the biggest source of latency variance. We had to tune context window sizes aggressively and pre-warm inference contexts for common query patterns.

4. Text-to-Speech (< 250ms first audio)

IndicTTS from AI4Bharat generates natural speech in the target language. The critical optimisation: we begin TTS on the first sentence while the LLM is still generating the second. The caller hears audio start playing within 250ms of the LLM beginning its response.

5. Audio Streaming Out (< 50ms added latency)

PCM audio is streamed back to the caller in real-time. We use a small jitter buffer to smooth packet delivery without adding perceptible latency.

What We Had to Give Up

Getting to under 2 seconds required real tradeoffs:

Context window size — we cap conversation history at 8 turns for latency. Long conversations eventually lose early context.
Model size — we use smaller, faster models. Accuracy is slightly lower than the state-of-the-art research models.
Safety layers — we run content filtering asynchronously, which means a problematic response could reach the caller before being flagged.

The Numbers

On a JOHNAIC-80 server (64 vCPU, 128GB RAM, 80GB GPU):

p50 total latency: 1.4s
p95 total latency: 1.9s
p99 total latency: 2.3s

Across 50 simultaneous calls, these numbers remain stable. The bottleneck is GPU memory bandwidth, not CPU.

All posts11 min read

How We Got Voice Response Under 2 Seconds