More articles

Higgs Audio v3 TTS: Beyond Reading, Toward Real Speech for Voice AI
Higgs Audio v3 TTS is built for voice chat: it speaks, not just reads. It turns model responses into expressive conversational speech across 100+ languages, with zero-shot voice cloning and inline control over emotion, style, prosody, pauses, and sound effects.
Voice AI needs a different kind of text-to-speech. In a live conversation, speech is not just the last step after text generation. It is how the agent answers, reacts, pauses, emphasizes, and carries the turn.
Higgs Audio v3 TTS is built for that setting: beyond reading, toward real speech for voice AI. It keeps the reliability of a production TTS system, but it is designed to speak model responses in the moment, with the timing and expression that make an agent feel conversational.
The model is directly controllable from the text stream. Inline tags can change emotion, style, speed, pitch, pauses, and sound effects mid-utterance, so developers can shape how a response is spoken without leaving the generation flow.
Out of the box, Higgs Audio v3 TTS reaches single-digit WER/CER on 100 languages. Across Seed-TTS, CV3, MiniMax-Multilingual, and Higgs-Multilingual, v3 sets the lowest WER against Higgs Audio v2 and a broad comparison set of open and commercial systems.
Multilingual Benchmarks
We evaluate Higgs Audio v3 TTS on public multilingual TTS suites and our internal 111-language Higgs-Multilingual set, covering both common and lower-resource languages.
The table reports macro-averaged WER/CER (↓, x100). Lower is better; highlighted cells mark the best result per row. Non-Higgs bests are selected from Fish Audio S2 Pro, Qwen3-TTS-1.7B, VibeVoice-7B, IndexTTS-2, MiMo-Audio-7B-Instruct, MOSS-TTS-v1.5, OmniVoice, ChatterBox, and FireRedTTS-2.
| Benchmark | # langs | Higgs Audio v2 | Higgs Audio v3 | Non-Higgs Best Model |
|---|---|---|---|---|
| SeedTTS | 2 | 2.10 | 1.11 | 1.21 (OmniVoice) |
| CV3 | 13 | 21.19 | 4.41 | 4.60 (Fish Audio S2 Pro) |
| MiniMax-Multilingual | 32 | 49.86 | 2.74 | 2.98 (OmniVoice) |
| Higgs-Multilingual | 111 | 52.24 | 3.61 | 3.63 (OmniVoice) |
Higgs Audio v3's results have been reproduced by SGLang-Omni team.
Conversational Behavior Benchmarks
Emergent TTS evaluates conversational behaviors that are hard to capture with transcript accuracy alone, including emotion, foreign words, paralinguistic cues, complex pronunciation, questions, and syntactic complexity.
Win-rate (↑) per category, measured as judge preference versus the baseline row. For a fair comparison, every model shares the same reference audio per prompt, and we run the benchmark text verbatim with no inline control tags inserted.
| Category | Higgs Audio v3 | Fish Audio S2 Pro | Qwen3-TTS-1.7B | IndexTTS-2 | MOSS-TTS-v1.5 | OmniVoice |
|---|---|---|---|---|---|---|
| Overall ↑ | 53.65% | 43.80% | 38.84% | 31.12% | 43.89% | 40.82% |
| Emotions ↑ | 53.75% | 53.04% | 45.54% | 39.29% | 60.54% | 61.07% |
| Foreign Words ↑ | 48.75% | 33.93% | 24.64% | 5.36% | 35.18% | 28.75% |
| Paraling- uistics ↑ | 68.57% | 53.75% | 44.29% | 42.50% | 51.43% | 52.68% |
| Complex Pronunciation ↑ | 25.10% | 18.16% | 30.00% | 12.45% | 11.63% | 13.67% |
| Questions ↑ | 61.43% | 55.00% | 53.39% | 45.89% | 53.21% | 45.00% |
| Syntactic Complexity ↑ | 60.71% | 45.71% | 34.11% | 38.93% | 47.32% | 40.36% |
Try It
The fastest way to hear Higgs Audio v3 TTS is in Boson Workspace. Choose a voice, enter a response, and experiment with emotion, prosody, pauses, and style tags directly in the browser.
We’re currently stabilizing the playground. Thanks for your patience.

When you are ready to build, use the Boson API. The endpoint supports blocking and streaming generation, voice cloning from audio references, and the same inline controls used in the workspace.
For local inference, the model weights are available on Hugging Face. You can serve them with SGLang-Omni using the Higgs TTS cookbook.
Acknowledgments
Contributors: Silin Meng, Ke Bai, Ruskin Raj Manku, Huapeng Zhou, Jonah Mackey, Dongming Shen, Erik Li, Weisu Yin, Yizhi Liu, Xinyu Wang, Alex Chen, Lindsey Allen, Mu Li
Special thanks to the broader Boson team for supporting training and evaluation, and SGLang-Omni team for optimizing inference.