Boson AI

Introducing Version 2 of Higgs Audio Generation

July 22, 2025•By Boson AI Team

Play the entire pageVoiced by BosonAI

Announcing Version 2 of Higgs Audio Generation, our latest advancement in audio generation technology with enhanced multi-speaker and dialog capabilities. Now open source.

We are open-sourcing Higgs Audio V2, a powerful audio foundation model pretrained on over 10 million hours of audio data and a diverse set of text data. Despite having no post-training or fine-tuning, Higgs Audio V2 excels in expressive audio generation, thanks to its deep language and acoustic understanding. This means that you can now focus on telling the model how you want it to render a conversation, or you can simply trust it to deliver an exceptionally believable performance, all on its own. Let's look at some numbers.

On EmergentTTS-Eval, it achieves win rates of 75.7% and 55.7% over "gpt-4o-mini-tts" on the "Emotions" and "Questions" categories, respectively. It also obtains state-of-the-art performance on traditional TTS benchmarks like Seed-TTS Eval and Emotional Speech Dataset (ESD). Moreover, the model demonstrates capabilities rarely seen in previous systems, including automatic prosody adaptation during narration, zero-shot generation of natural multi-speaker dialogues in multiple languages, melodic humming with the cloned voice, and simultaneous generation of speech and background music.

Key Features

Higgs Audio V2 represents a significant leap forward in audio AI capabilities:

Multi-speaker conversations can be tricky, if speakers can't match each other's energy and emotions. With Higgs Audio V2 this becomes easy - the conversation feels alive and live.
Long form audio generation requires consistent voice, while simultaneously being authentic, engaging and lifelike. Higgs Audio allows for conditioning and prompting to achieve excellent long-form audio. Why not try out an audio version of this blog post.
High Fidelity Audio is key for lifelike audio on high quality speakers and headphones. With V2 we upgraded our audio pipeline from 16kHz to 24kHz for even better sound.
Resource Efficient inference matters when it comes to running models, be it for hobby projects or for large scale commercial serving. Our smallest models can run on a Jetson Orin Nano. For the latest 3B Audio Generation V2 Model you will need at least an RTX 4090 for efficient inference.
Leading performance in generating lifelike and emotionally competent voice. In the EmergentTTS-Eval benchmark it achieves a win rate of over 75% against ChatGPT 4o.
Open source because we believe that everyone should be able to try it out for free.
Trained on over 10M hours for even better audio quality and more lifelike voices. Our model relies on a sophisticated processing and annotation pipeline for automatic training data generation.

Examples

Example 1: Higgs Audio interactive conversation translation with voice cloning.

Example 2: Three voices cloning conversation generation

Shrek:Donkey, look at this wee metal dragon! Cost me a king's ransom, and all it does is fry tatties.

Donkey:That's a GPU, Shrek! You bought it to train AI for swamp visitor alerts.

Shrek:Aye, and it always says "right now." Useless!

Fiona:Should've asked it to predict when you'll stop wasting gold.

Donkey:GPUs are like onions: layers, tears, and a terrible return on investment.

Shrek:I'll stick to parfaits. Cheaper, tastier, no tensor cores.

Fiona:And no overheating. Unlike someone I know.

Key Innovations

Let's dive a bit deeper into the architecture of Higgs Audio Generation V2. At its heart lies a Large Language Model that provides language 'intelligence', paired with careful tokenization in audio processing, to avoid overloading the model with a plethora of tokens. The model relies on the following innovations:

An automated annotation pipeline that leverages multiple ASR models, sound-event classification models, and our in-house audio understanding model. Based on the automated pipeline, we obtained 10 million hours of filtered annotated audio data, called AudioVerse.
The understanding model itself is finetuned on top of Higgs Audio V1 Understanding. It is adopted in the "understanding variant" in the architecture figure below.
A unified audio tokenizer captures both semantic and acoustic features.
A Dual FFN (feed forward network) architecture, which enhances the LLM's ability to model both text and acoustic tokens efficiently while sharing information via cross-attention.

Benchmarks

We are particularly interested in how natural, humanlike and emotionally competent our model is. For that purpose, we assess its performance on four benchmarks: Seed-TTS Eval, Emotional Speech Dataset (ESD), EmergentTTS-Eval, and Multi-speaker Eval.

Seed-TTS Eval and Emotional Speech Dataset

We prompt Higgs Audio with <ref_text>, <ref_audio>, and <text> tags for zero-shot TTS. We adopt the standard evaluation metric in Seed-TTS Eval and ESD.

	Seed-TTS Eval		ESD
	WER ↓	SIM ↑	WER ↓	SIM (emo2vec) ↑
Cosyvoice2	2.28	65.49	2.71	80.48
Qwen2.5-omni†	2.33	64.10	-	-
ElevenLabs Multilingual V2	1.43	50.00	1.66	65.87
HiggsAudio V1	2.18	66.27	1.49	82.84
HiggsAudio V2 (base)	2.44	67.70	1.78	86.13

EmergentTTS-Eval ("Emotions" and "Questions")

Following the EmergentTTS-Eval Paper, we report the win-rate over "gpt-4o-mini-tts" with the "alloy" voice. Since Higgs Audio V2 is a pretrained model, we report performance by cloning the "belinda" voice.

Model	Emotions (%) ↑	Questions (%) ↑
Higgs Audio V2 (base)	75.71%	55.71%
gpt-4o-audio-preview†	61.64%	47.85%
Hume.AI	61.60%	43.21%
BASELINE: gpt-4o-mini-tts	50.00%	50.00%
Qwen 2.5 Omni†	41.60%	51.78%
minimax/speech-02-hd	40.86%	47.32%
ElevenLabs Multilingual v2	30.35%	39.46%
DeepGram Aura-2	29.28%	48.21%
Sesame csm-1B	15.96%	31.78%

Multi-speaker Eval

We also designed a multi-speaker evaluation benchmark to evaluate the capability of Higgs Audio V2 for multi-speaker dialog generation. The benchmark contains three subsets

two-speaker-conversation: 1000 synthetic dialogues involving two speakers. It contains two reference audio clips to evaluate the model's ability in double voice cloning.
small talk: 250 synthetic dialogues characterized by short utterances and a limited number of turns (4–6). It also contains two reference audio clips to test double voice cloning, though the dialogues are shorter and simpler than those in two-speaker-conversation.
small talk (no ref): 250 synthetic dialogues, also with short utterances and 4–6 turns. Unlike the other subsets, it does not include reference audio and is designed to evaluate the model's ability to automatically assign appropriate voices to speakers.

We evaluate the word-error-rate (WER) and the geometric mean between intra-speaker similarity and inter-speaker dis-similarity on these three subsets. Other than Higgs Audio V2, we also evaluated MoonCast and nari-labs/dia, two of the most popular open-source models capable of multi-speaker dialog generation. Results are summarized in the following table. We are not able to run nari-labs/dia on our "two-speaker-conversation" subset due to its strict limitation on the length of the utterances.

	two-speaker-conversation		small talk		small talk (no ref)
	WER ↓	Mean Sim & Dis-sim ↑	WER ↓	Mean Sim & Dis-sim ↑	WER ↓	Mean Sim & Dis-sim ↑
MoonCast	38.77	46.02	8.33	63.68	24.65	53.94
nari-labs/dia	-	-	17.62	63.15	19.46	61.14
Higgs Audio V2 (base)	18.88	51.95	11.89	67.92	14.65	55.28

For more details see the documentation in our GitHub repository.

Conclusion

BosonAI's powerful Large Audio Language Model for audio generation – Higgs Audio v2 – is now open source. It's the first open-source, large-scale audio model that excels at multi-speaker, lifelike and emotionally competent voice generation. Higgs Audio v2 opens doors for developers, creatives, and researchers to build conversational agents, audiobooks, podcasts, and more with lifelike performance.

Higgs Audio v2 model is trained on a massive self‑annotated corpus of over 10M hours of audio data, using BosonAI's ASR, and LLM models. We are releasing the pretrained model as open source. Our model adopts an innovative Dual-FFN architecture that is capable of handling text and audio tokens jointly. Moreover, the tokenizer has dedicated representations for both semantic and acoustic aspects of the audio. This allows for both a relatively low token rate of 25Hz and high acoustic fidelity of the generated audio. This combination of model, tokenization and data allows Higgs Audio v2 to generate natural and pleasing emotional speech, dialogue and interaction. We are really pleased that Higgs Audio v2 has achieved state-of-art performance, beating "gpt‑4o‑mini‑tts" with 75.7 % win rate on "Emotions" and 55.7 % on "Questions" in EmergentTTS‑Eval.

Some of the key features of Higgs Audio v2:

Multi-speaker conversations can be tricky, if speakers can't match each other's energy and emotions. With Higgs Audio V2 this becomes easy – the conversation feels alive and live.
Long form audio generation requires consistent voice, while simultaneously being authentic, engaging and lifelike. Higgs Audio allows for conditioning and prompting to achieve excellent long-form audio.
High Fidelity Audio is key for lifelike audio on high quality speakers and headphones. With V2 we upgraded our audio pipeline from 16kHz to 24kHz for even better sound.
Resource Efficient inference matters when it comes to running models, be it for hobby projects or for large scale commercial serving. Our smallest models can run on a Jetson Orin Nano.

Give it a try, either by cloning our GitHub repository, by trying out our online demo or by visiting our HuggingFace Space.

Acknowledgments

Lead: Alex Smola, Mu Li and Xingjian Shi

Pretrain: Xingjian Shi, Mu Li

Audio Tokenizer: Martin Ma, Ke Bai, Silin Meng, Murdock Aubry, Xingjian Shi

Data: Mu Li, Dongming Shen, Silin Meng

Evaluation: Ruskin Raj Manku, Ke Bai, Yuzhi Tang

Serving & Playground: Zach Zheng, Yizhi Liu

DC Infrastructure: Sergii Tiugaiev

We would like to thank our customers for their constructive feedback and the excellent technical support from our friends at NVIDIA, eStruxture and ARC Compute.

#higgs-audio

#text-to-speech

#voice-generation

#multimodal