Avatar & Voice Agent Leaderboard

The avatar arena and benchmark for real-time AI avatars and voice agents. Compare avatar providers (HeyGen, Tavus, Anam, Avaturn) alongside the LLM and TTS components that power them — ranked on lip-sync, gaze, startup time, latency, and price from real recorded conversations.

Composite latency = startup time + 5 × avatar latency. The per-turn avatar render is paid on every reply, so it is weighted 5× the one-off cold start. Lower is better.

#	Lab								Latency turns	Sessions	Duration
1	Avaturn	17467±1770	961±277	3301±350	4317±291	71.7%±3.9	99.4%±0.3	0.0916	81	27	23m
2	HeyGen	9317±594	4694±537	925±51	3236±304	88.7%±2.7	99.2%±0.9	0.1266	67	23	21m
3	Anam	6874±901	2574±147	860±178	3537±330	53.2%±15.6	98.9%±2.0	0.196	43	13	12m
4	Tavus	8451±839	3185±637	1053±109	2237±262	42.0%±5.5	97.4%±3.1	0.3371	168	85	1h 24m

Values are averages across sessions; ± shows the 95% confidence interval. Click a column to sort. Higher is better for percentages; lower is better for latencies.

Response latency is end-to-end, measured from the captured session audio, so it covers every avatar provider. Pipeline latency and LLM/TTS TTFB are component timings from the speech pipeline and are unavailable for providers that serve audio directly to the client (shown as “—”). ASR is not yet ranked (Deepgram is used for every session).