Voice Agents at Scale — jasonedward.me

Voice is the last frontier of AI interaction. Text has been conquered. Vision is being conquered. But voice—real-time, conversational voice—is still wide open.

Why Voice is Different

Text lets you think. You type, you pause, you revise. Voice doesn’t let you do any of that. The agent has maybe 500ms to understand what you said and start responding. That’s not a prompt engineering problem—that’s a systems problem.

Building Codius

Codius isn’t just a chatbot with a microphone. It’s a voice-first agent that:

Understands context — It remembers what you told it yesterday
Interrupts appropriately — It doesn’t wait for you to finish if it knows the answer
Handles real speech — Accents, background noise, mumbling
Learns from interaction — Each conversation makes it slightly better

The Hard Parts

Latency

Real-time transcription + LLM inference + TTS = 3+ seconds if you’re not careful. Users expect responses in 1-2 seconds. We shaved 800ms just by:

Starting TTS before the LLM finishes
Using cached embeddings for context
Streaming tokens instead of waiting for completion

Context Management

In text, you have chat history. In voice, you need to infer intent from sparse input. We built a memory system that:

Extracts key facts from each call
Weights recent conversations higher
Surfaces relevant context automatically

Interruption

Users expect natural conversation. That means:

Detecting when they’re done talking (not just silence)
Jumping in when appropriate
Backing off when they want to finish their thought

What’s Next

The teams building voice agents today will own customer relationships in 5 years. Text-first interfaces are going to feel slow and clunky in retrospect.

But you can’t just bolt a microphone onto a chatbot and call it done. You need to rethink everything from first principles.