The avatar arena and benchmark for real-time AI avatars and voice agents. Compare avatar providers (HeyGen, Tavus, Anam, Avaturn) alongside the LLM and TTS components that power them — ranked on lip-sync, gaze, startup time, latency, and price from real recorded conversations.
Composite latency = startup time + 5 × avatar latency. The per-turn avatar render is paid on every reply, so it is weighted 5× the one-off cold start. Lower is better.
| # | Lab | Latency turns | Sessions | Duration | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Tavus | 11613±706 | 6483±264 | 1026±131 | 2030±400 | 52.5%±9.7 | 100.0%±0.0 | 0.59 | 38 | 26 | 23m |
| 2 | Avaturn | 18153±1605 | 1018±302 | 3427±315 | 4165±433 | 73.5%±3.0 | 100.0%±0.0 | 0.0916 | 44 | 24 | 22m |
| 3 | Anam | 7054±1180 | 2512±180 | 908±233 | 3321±219 | 55.9%±16.9 | 100.0%±0.0 | 0.196 | 15 | 8 | 7m |
| 4 | HeyGen | 9213±579 | 4733±556 | 896±32 | 3285±414 | 89.0%±2.7 | 99.8%±0.3 | 0.1266 | 34 | 22 | 20m |
Values are averages across sessions; ± shows the 95% confidence interval. Click a column to sort. Higher is better for percentages; lower is better for latencies.
Response latency is end-to-end, measured from the captured session audio, so it covers every avatar provider. Pipeline latency and LLM/TTS TTFB are component timings from the speech pipeline and are unavailable for providers that serve audio directly to the client (shown as “—”). ASR is not yet ranked (Deepgram is used for every session).