In Productionengineering

Autonomous Voice Agent

2,100+ concurrent AI sales calls at 1.1s latency

Timeline: Oct 2025 — Mar 2026
Role: ML Engineer @ Outlyst
Status: In Production
Primary stack: FastAPI · Retell AI · AsyncIO · PostgreSQL

By the numbers

Headline metrics

54%

Latency reduction

2,100+

Concurrent sessions

25%

Lead conversion lift

The problem

What this project tackles

A voice agent's quality is dominated by latency. A 2.4-second response feels like a bad cell connection; under 1.2 seconds it feels human enough that the prospect stays on the call. The Outlyst voice agent was clearing 2.4s on warm calls and degrading further as concurrency rose — past 200 simultaneous sessions, response times spiked unpredictably and a fraction of sessions dropped entirely.

The hard part is that Retell AI handles speech recognition and TTS — the backend just answers structured tool calls — but the round-trip from ASR through inference and back is dominated by what we do in those middle hundreds of milliseconds. Profiling, not architecture redesign, was the actual problem.

Goal: get average call latency under 1.2s, hold it stable past 2,000 concurrent sessions, and do it without horizontal scaling that would have killed the unit economics.

Approach

System design

I instrumented the FastAPI inference backend with py-spy and asyncio task tracing. The traces showed two bottlenecks the metrics dashboards had missed: a synchronous ORM call on each tool invocation that blocked the event loop, and a connection pool sized for the wrong concurrency profile — pools sized for HTTP request bursts, not long-lived websocket sessions.

Replaced the synchronous ORM with asyncpg for direct PostgreSQL access on the hot path. Restructured the connection pool sizing based on observed concurrent-session distribution rather than peak request rate. Parallelised independent tool calls with asyncio.gather() so a single user turn could query CRM, calendar, and contact-enrichment simultaneously instead of in sequence.

Built a lightweight gatekeeper-detection classifier that runs before the main inference loop, so we don't burn LLM tokens on receptionists who'll just transfer the call. Detected gatekeepers route to a callback scheduler instead of a dead-end transfer.

Added structured CRM sync via automated extraction pipelines, removing the manual data entry that was costing the team 100+ staff hours per week.

Engineering

Key technical decisions

— asyncpg over SQLAlchemy

SQLAlchemy's async support is real but layered with abstractions that show up in flame graphs. asyncpg is the actual driver, no ORM, and the inference backend doesn't need migrations or relationship modeling at request time — just fast reads and writes against a known schema.

— py-spy over cProfile

py-spy samples without instrumentation, so we could profile the production process under real load without restarting it or distorting timing. cProfile would have changed the timing it was measuring.

— Pool sizing for sessions, not requests

A websocket call holds a session for minutes, not milliseconds. Sizing the pool for HTTP request volume gave us tens of pooled connections trying to serve thousands of long-lived sessions. Sizing it for the observed in-flight session distribution fixed the contention without adding infrastructure.

— Gatekeeper classifier before inference

Cheap-and-fast filter beats expensive-and-smart. Detecting 'this is a receptionist, not the prospect' with a small classifier saves ~3-5 minutes of GPU time per gated call. It also routes those calls to a callback scheduler instead of dead-ending.

Results

What it delivers

Mean call latency dropped from 2.4s to 1.1s — a 54% reduction — without horizontal scaling. The system now sustains 2,100+ concurrent stateful websocket sessions without session drop, where it previously degraded past 200.

Downstream business impact: 25% lift in lead conversions, 27 qualified leads generated through the gatekeeper-aware routing, and 100+ staff hours per week reclaimed from the automated CRM sync pipeline.

Reflections

What I'd do next

The next 200ms of latency reduction is going to come from the LLM inference itself, not the surrounding plumbing — speculative decoding, smaller fine-tuned models for the specific tool-call patterns, or moving the gatekeeper classifier to a co-located CPU model. The plumbing is mostly drained.

/* TODO: Hammad — add a reflection on the operational cost of pool restructuring under load (downtime risk, rollback plan), or what you'd warn someone profiling a production websocket service for the first time. */

Other case studies

Continue reading

Research

FinLaw-UK

Graph-augmented RAG for UK financial regulation

+19%Answer accuracy

0.76RAGAS faithfulness

Mistral 7BNeo4jSentence TransformersRAGAS

Read case study

Shipped

Jobzyl

Unified job-search aggregator with ATS resume matching

6Job boards aggregated

11RLS-locked tables

Next.jsSupabaseFastAPIAWS