Voice Agents at Scale

Lessons from building Codius, an autonomous voice agent that learns and improves over time.

· 8 min read

Voice is the last frontier of AI interaction. Text has been conquered. Vision is being conquered. But voice—real-time, conversational voice—is still wide open.

Why Voice is Different

Text lets you think. You type, you pause, you revise. Voice doesn’t let you do any of that. The agent has maybe 500ms to understand what you said and start responding. That’s not a prompt engineering problem—that’s a systems problem.

Building Codius

Codius isn’t just a chatbot with a microphone. It’s a voice-first agent that:

The Hard Parts

Latency

Real-time transcription + LLM inference + TTS = 3+ seconds if you’re not careful. Users expect responses in 1-2 seconds. We shaved 800ms just by:

Context Management

In text, you have chat history. In voice, you need to infer intent from sparse input. We built a memory system that:

Interruption

Users expect natural conversation. That means:

What’s Next

The teams building voice agents today will own customer relationships in 5 years. Text-first interfaces are going to feel slow and clunky in retrospect.

But you can’t just bolt a microphone onto a chatbot and call it done. You need to rethink everything from first principles.