Boson AI

Introducing Higgs-Audio - Advanced Audio Understanding and Generation

April 3, 2025•By Boson AI Team

Play the entire pageVoiced by BosonAI

At Boson AI, we work on making communication with AI as easy, natural and fun as talking to a human. Today, we are excited to introduce Higgs Audio Understanding and Higgs Audio Generation — two powerful tools designed to build customized AI agents tailored for diverse audio understanding and generation needs.

Higgs Audio Generation

To communicate with humans in a delightful and natural manner, we need to be able to generate realistic, emotionally competent and well-accentuated speech. We need a system that is capable of pronouncing words correctly, even if they derive from a foreign language, particularly for people’s names and places. We need a system that can generate conversations between multiple speakers, particularly when multiple characters in games are involved, or when reading books or screenplays.

Pure TTS (text to speech) systems struggle at these tasks, since they typically do not understand the meaning of what they’re generating, or any sense of urgency, hesitation, or other intonations that would be plainly obvious to a human speaker. They also struggle to adopt the natural character of a speaker, e.g. whether they’re naturally enthusiastic or more deliberate and thoughtful.

The way to address this problem is to build a TTS system using a Large Language Model (LLM) as a backbone. This endows the TTS system with the understanding needed to generate competent speech. Higgs Audio Generation enhances the underlying LLM to process audio by treating raw audio as tokens. This approach enables the model to be trained end-to-end on extensive text-audio datasets.

The base model we are introducing today demonstrates impressive performance on benchmark tests. Additionally, it showcases emerging capabilities, including generating speech with emotional tone based on text semantics and producing multi-speaker dialogues from written transcripts, all due to the improved understanding. Before diving into technical details, let’s listen to two examples of audio generated by our model.

A profound sense of realization washed over Beal as he whispered, "You've been there for me all along, haven't you? I never truly appreciated you until now."

Overwhelmed with confusion and despair, David Darlan cried out, "What do you want from me? Why can't you just tell me what's wrong? Leave me alone!"

Higgs Audio Understanding

Of course, beautifully sounding audio is only part of the story. We need to verify that it works better objectively. For that purpose we evaluate the performance of Higgs Audio against CosyVoice2, QWen2.5-omni, and ElevenLabs on two widely used audio generation benchmarks: Seed-TTS Evaland the Emotional Speech Dataset (ESD). In this comparison, models are provided with a reference (text, audio) pair to generate audio for the target text, while ensuring that the output matches the style of the reference audio. As we can see, Higgs Audio is meaningfully better at generation than the reference models, including ElevenLabs.

	Seed-TTS Eval		ESD
	WER ↓	SIM ↑	WER ↓	SIM ↑
Cosyvoice2	2.28	65.49	2.71	80.48
Qwen2.5-omni†	2.33	64.10	-	-
ElevenLabs	1.43	50.0	1.66	65.87
Higgs Audio	2.18	66.27	1.49	82.84

† Qwen2.5-omni's performance is from the official report.

Audio Generation Judger

Over the past decade, TTS has improved dramatically. With this, assessing their quality becomes increasingly difficult. Traditional metrics such as word error rate (WER) or Mean Opinion Score (MOS) provide only a rough estimate of speech quality. In particular, these metrics fail to capture crucial elements like the naturalness of tone, pitch, energy, pauses, and non-verbal cues such as sighs.

This problem is reminiscent to problems in natural language processing, where the mere agreement in the number of characters or words is no longer an accurate measure of quality. For instance ‘I saw the cat’ is more similar to ‘I saw the kitten’ than ‘I saw the rat’, even though the latter differs from the first sentence only by a single character.

Drawing inspiration from the concept of LLM-as-a-judge, we leverage advanced audio understanding models to assess the quality of generated audio. Specifically, we selected 120 text prompts from BASE TTS categories, including “Compound Nouns”, “Emotions”, “Foreign Words”, “Paralinguistics”, “Questions”, and “Syntactic Complexities”. We then use Gemini-2.0 Flash to evaluate whether the generated audio outperforms the industry standard, ElevenLabs. We refer to this evaluation as the EmergentTTS-Eval benchmark. Let’s listen to how this works in practice:

His face lit up with pure delight as he exclaimed, "We did it! We won the championship! I knew we could do it together!"

Higgs Audio (System 1)

ElevenLabs (System 2)

Judger output: Both systems successfully synthesized the text. However, system 1 better captured the emotion of excitement in the phrase "We did it! We won the championship!" by using a more enthusiastic tone and varying the pitch to convey the speaker's delight. System 2, while clear and understandable, sounded less emotionally expressive, making system 1 the winner.

We compare the model’s performance against QWen2.5-omni, ElevenLabs, andGPT-4o-mini-TTS using the EmergentTTS-Eval benchmark.

	WER ↓	Win-rate ↑
Qwen2.5-omni	6.74	50.83
GPT-4o-mini-tts	3.14	58.33
ElevenLabs	1.31	50.00
Higgs Audio	1.82	61.67

This illustrates that model is capable of producing natural and emotionally expressive speech that aligns with the semantic context. In terms of benchmark numbers, our model performs well relative to other models in a paired comparison. ElevenLabs has a win rate of 50% as we used it as the baseline comparator for our benchmark.

Emergent Capability of Generating Multi-speaker Dialogues

We noticed that Higgs Audio can produce realistic multi-speaker dialogues from a transcript. This ability highlights the model’s strong semantic understanding of the text, enabling it to uncover the underlying story and respond accordingly. Below are some audio examples created by directly feeding the raw transcript (generated by ChatGPT) into the model. You’ll observe that the model successfully role-plays multiple characters and generates natural interruptions and filler words.

SPEAKER 0You're training Higgs Audio again? Aren't you tired of staring at it all day?

SPEAKER 1You're training Higgs Audio again? Aren't you tired of staring at it all day?

SPEAKER 0Oh, so you want it to sound like a real conversation with multiple people? That sounds… tricky.

SPEAKER 1It is. The biggest challenge is making sure it understands who's speaking and when. We need a solid dataset with real conversations, including interruptions and natural flow.

SPEAKER 0Right, because real conversations aren't just people taking turns like robots. There are overlaps, hesitations, and sudden topic changes.

SPEAKER 1Exactly! That's why we need speaker diarization—so the model knows when one speaker stops and another starts, even if they overlap.

Higgs Audio Understanding

Speaking is only half of the story. Listening is the other half. To build a competent system for human-machine interaction, such as a sales agent we need an audio understanding model. Again, this goes beyond mere speech recognition, as the emotion, the context and the background noises all matter to further our understanding.

Similar to Higgs Audio Generation, we start with a pretrained LLM. To obtain the generator, we feed the raw audio into the LLM. This yields an end-to-end model, using large-scale text-audio understanding datasets. The pretrained base model demonstrates impressive performance across both audio understanding and reasoning benchmarks.

Understanding Benchmarks

We evaluated Higgs Audio on several audio understanding benchmarks, comparing it to Gemini-2.0-flash and GPT-4o-audio. Higgs Audio shows strong performance across the board.

Reasoning Benchmarks

Next we evaluated the model’s audio reasoning capabilities on MMAU. Higgs Audio performs well on sound and speech tasks. It lags behind others in music tasks due to limited music data coverage in our datasets. Nonetheless, by utilizing the Chain-of-Thought (COT) capacity of the base LLM, its performance on music tasks is significantly enhanced.

To explore more about Higgs Audio, feel free to experiment with the generation playground or engage with the live voice chat demo. If you’re interested in integrating advanced speech recognition, natural voice synthesis, or both into your applications, don’t hesitate to reach out to our sales team.

Acknowledgments

Lead: Alex Smola, Mu Li, Xingjian Shi

Model – Understanding: Jielin Qiu, Dongming Shen, Silin Meng, Rand Xie

Model – Generation: Xingjian Shi, Martin Ma, Ke Bai, Ruskin Raj Manku

Audio Tokenizer: Martin Ma, Ke Bai, Xingjian Shi

Evaluation: Ruskin Raj Manku, Jaewon Lee

Data - Pretrain: Mu Li, Ke Bai, Jaewon Lee, Geeyang Tay, Yizhi Liu, Yi Zhu

Data - Synthetic: Dongming Shen, Silin Meng

Distributed Training: Shuai Zheng, Sergii Tiugaiev

Serving: Zach Zheng

Playground: Yizhi Liu, Rand Xie

We would like to thank our customers for their constructive feedback and the excellent technical support from our friends at NVIDIA, Arc Compute, eStruxture, Crusoe, AWS, Scaleway, and MKTech.

#higgs-audio

#speech-recognition

#text-to-speech

#audio-understanding

#voice-generation

Introducing Higgs-Audio - Advanced Audio Understanding and Generation

Higgs Audio Generation

Higgs Audio Understanding

Audio Generation Judger

Emergent Capability of Generating Multi-speaker Dialogues

Training Higgs Audio

Conversation about music

Argument (Warning - loud audio)

Higgs Boson particle (with interruptions)

Higgs Audio Understanding

Understanding Benchmarks

Reasoning Benchmarks

CoT example 1 - Speech

CoT example 2 - Sound

CoT example 3 - Music

Acknowledgments