audio generation model

We are pleased to announce Higgs Audio V2, advanced voice synthesis for natural and expressive audio, including multi-speaker promptable dialog, and voice cloning. Now open source.
Traditional text-to-speech systems produce robotic voices that lack emotional depth. They struggle with names, accents, and fail to capture the natural flow of human conversation. The challenge becomes even greater with multi-speaker dialogues—where timing, interruptions, and overlapping speech create the authentic feel of real conversation.
Higgs Audio V2 transforms speech synthesis with emotionally expressive voices and natural multi-speaker dialogue. Building on V1's foundation, our latest model delivers richer emotional range, supports voice cloning, and follows directorial instructions. Best of all, Higgs Audio V2 is now open source under a derivative Llama license. By leveraging a backing large language model (LLM), Higgs Audio V2 understands context to deliver appropriate pauses, emphasis, and emotional nuance at 24kHz high fidelity. Let's listen to an example before we dive into more detail.
Overview
Higgs Audio V2 produces emotionally expressive speech and conversations, whether working with minimal annotation or following detailed instructions:
- Wide range of natural voices and accents.
- Multi-speaker dialogue with natural pauses and overlaps
- Voice cloning for authentic character consistency
- Director-level control through prompts and instructions
- Resource efficient, runs on consumer hardware from Jetson Orin Nano to NVIDIA RTX 5090
- Realistic soundscapes alongside speech generation
- Accurate rendering of foreign names, places, and concepts
- 24kHz high fidelity audio output
- Trained on over 10 million hours of audio
- Industry-leading benchmark performance
- Open source under a derivative Llama license
How it works
Our model combines several innovations to achieve its performance. Dedicated tokenizers for acoustic and semantic tokens enable high-quality output while maintaining engagement.
Operating at just 25Hz token frequency allows longer-range context within LLM windows. A novel dual-FFN architecture processes both text and audio tokens efficiently, while meticulous data processing ensures high-quality annotation.
This architecture, combined with a pretrained LLM's intelligence, enables our model to infer accent, gender, age, and mood directly from text—much like a human reader would. Directors can instruct the model to render voices and dialogues with specific emotional qualities, timing, and character traits. Read our technical blog post for more details.
Performance Benchmarks
We evaluated Higgs Audio V2 against leading audio generation models using standardized benchmarks. Using EmergentTTS-Eval with Gemini 2.5 as an impartial judge, we tested performance on Emotions and Questions benchmarks against GPT-4o as baseline. This methodology provides reproducible results that correlate strongly with human preferences while being reproducible due to its automation.
Higgs Audio V2 consistently outperforms all tested models, including Hume and Qwen Omni, in voice quality and emotional expression:

For multi-speaker dialogue generation, a particularly challenging task, we compared against Mooncast, the leading open-source solution. Higgs Audio V2 demonstrates significant improvements in dialogue naturalness and speaker differentiation:

Examples
Experience the quality and emotional range of Higgs Audio V2 through these examples. Each demonstrates different aspects of our technology - from emotional expression to multi-speaker dialogue with natural interruptions.
Multi-speaker Audiobook
The stars shimmered through the cracked dome of the abandoned observatory. A low hum pulsed beneath the floor — the signal was getting stronger.
[music end]
"Jay, do you hear that? It's not just interference... it's rhythmic, like it's trying to speak."
"Yeah... I boosted the gain, and now it's pulsing every 4.2 seconds. That's not random. Someone — or something — is calling us."
Lina stepped closer to the console, her fingers trembling slightly. The old screen flickered to life, and a single word etched itself across the glass: WELCOME.
Realistic Human Voices with audio tags
Expressive Multi-speaker Voice clone
Multi-lingual support (Chinese, Korean and more)
Key Features
- Voice cloning for natural dialog
- Multi-speaker dialog
- Rich emotional range
- 24kHz High Fidelity audio
- Open source
- Multilingual support
- Instructable