Boson AI

Most expressive opensource

audio generation model

Audio Generation

Play the entire pageVoiced by BosonAI

We are pleased to announce Higgs Audio V2, advanced voice synthesis for natural and expressive audio, including multi-speaker promptable dialog, and voice cloning. Now open source.

Traditional text-to-speech systems produce robotic voices that lack emotional depth. They struggle with names, accents, and fail to capture the natural flow of human conversation. The challenge becomes even greater with multi-speaker dialogues—where timing, interruptions, and overlapping speech create the authentic feel of real conversation.

Higgs Audio V2 transforms speech synthesis with emotionally expressive voices and natural multi-speaker dialogue. Building on V1's foundation, our latest model delivers richer emotional range, supports voice cloning, and follows directorial instructions. Best of all, Higgs Audio V2 is now open source under a derivative Llama license. By leveraging a backing large language model (LLM), Higgs Audio V2 understands context to deliver appropriate pauses, emphasis, and emotional nuance at 24kHz high fidelity. Let's listen to an example before we dive into more detail.

[Amy]You don't even talk to me anymore!

[Sheldon]I'm just focused lately, you know we're releasing this...

[Amy]Focused? You spent our weekend tuning a model instead of seeing my parents!

[Sheldon]But, that run was really important!

[Amy]You looked at your freaking loss curve longer than you looked at me.

[Sheldon]That is not true. I just needed it to converge! [laugh]

[Amy]You promised things would get better after your PhD.

[Sheldon]Babe, I swear, just give me a little more time. Everything will change after AGI. [laugh]

Overview

Higgs Audio V2 produces emotionally expressive speech and conversations, whether working with minimal annotation or following detailed instructions:

Wide range of natural voices and accents.
Multi-speaker dialogue with natural pauses and overlaps
Voice cloning for authentic character consistency
Director-level control through prompts and instructions
Resource efficient, runs on consumer hardware from Jetson Orin Nano to NVIDIA RTX 5090
Realistic soundscapes alongside speech generation
Accurate rendering of foreign names, places, and concepts
24kHz high fidelity audio output
Trained on over 10 million hours of audio
Industry-leading benchmark performance
Open source under a derivative Llama license

How it works

Our model combines several innovations to achieve its performance. Dedicated tokenizers for acoustic and semantic tokens enable high-quality output while maintaining engagement.

Operating at just 25Hz token frequency allows longer-range context within LLM windows. A novel dual-FFN architecture processes both text and audio tokens efficiently, while meticulous data processing ensures high-quality annotation.

This architecture, combined with a pretrained LLM's intelligence, enables our model to infer accent, gender, age, and mood directly from text—much like a human reader would. Directors can instruct the model to render voices and dialogues with specific emotional qualities, timing, and character traits. Read our technical blog post for more details.

Performance Benchmarks

We evaluated Higgs Audio V2 against leading audio generation models using standardized benchmarks. Using EmergentTTS-Eval with Gemini 2.5 as an impartial judge, we tested performance on Emotions and Questions benchmarks against GPT-4o as baseline. This methodology provides reproducible results that correlate strongly with human preferences while being reproducible due to its automation.

Higgs Audio V2 consistently outperforms all tested models, including Hume and Qwen Omni, in voice quality and emotional expression:

For multi-speaker dialogue generation, a particularly challenging task, we compared against Mooncast, the leading open-source solution. Higgs Audio V2 demonstrates significant improvements in dialogue naturalness and speaker differentiation:

Examples

Experience the quality and emotional range of Higgs Audio V2 through these examples. Each demonstrates different aspects of our technology - from emotional expression to multi-speaker dialogue with natural interruptions.

Multi-speaker Audiobook

[music start]
The stars shimmered through the cracked dome of the abandoned observatory. A low hum pulsed beneath the floor — the signal was getting stronger.
[music end]
"Jay, do you hear that? It's not just interference... it's rhythmic, like it's trying to speak."
"Yeah... I boosted the gain, and now it's pulsing every 4.2 seconds. That's not random. Someone — or something — is calling us."
Lina stepped closer to the console, her fingers trembling slightly. The old screen flickered to life, and a single word etched itself across the glass: WELCOME.

Realistic Human Voices with audio tags

Are you asking if I can hum a tune? Of course I can! [humming start] la la la la La [humming end] See?

Okay, so glad we had some time to chat, I have been wanting to chat with you... Give me just a second to put on some nice background music. [music] I like this one, hope you do too.

Expressive Multi-speaker Voice clone

ShrekDonkey, look at this wee metal dragon! Cost me a king's ransom, and all it does is fry tatties.

DonkeyThat's a GPU, Shrek! You bought it to train AI for swamp visitor alerts.

ShrekAye, and it always says "right now." Useless!

FionaShould've asked it to predict when you'll stop wasting gold.

DonkeyGPUs are like onions: layers, tears, and a terrible return on investment.

ShrekI'll stick to parfaits. Cheaper, tastier, no tensor cores.

FionaAnd no overheating. Unlike someone I know.

GreyGod, your voice, it's doing things to me. Do you always sound this sexy at midnight?

AnnaHumm... only when I'm talking to you. Mmm, something about you makes me wanna whisper like this.

GreyKeep whispering, babe. Makes me wanna pull you close and see what else you'll say, or moan.

AnnaAhh... you're bad. what if I said I'm already imagining your hands on me, slow and teasing?

Multi-lingual support (Chinese, Korean and more)

朋友们好啊，我是浑元机器学习掌门人马保国。这两年半啊，我一直在给大模型打工，做AI练习生，天天琢磨multimodal和deep learning，多少也算有点心得，大家懂的都懂哈。

昨天走在路上,被几个小朋友拦住了,说:“马老师,您这些voice demo,看着还不如B站鬼畜区那些视频.花好几个亿,结果还不如人家手搓."

我一听就乐了,说：“你们这就不懂了。B站那些上古鬼畜，全是活字印刷，纯靠人力。我这个可不一样！我这大模型，全靠算力--老黄卖我的GPU一开，啥坤坤，剑魔，奶龙，分分钟一键克隆，懂吗？"

小朋友还是不服气.我拍拍他肩膀,说：“老师今天点到为止。菜就多练，不懂就多学。AI赛道卷得快，跟不上时代就要掉队了！”

谢谢朋友们!

노래 듣지 말라고? 아니 노래라도 들어야 몸이 움직이지! [Laughter] 이건 거의 산소야, 산소! 기준이 뭐야 진짜? 아니 지는 일 얼마나 잘한다고 그래! 지는 맨날 컬투쇼 들으면서! 맨날 나한테만 뭐라고 하고! 휴가는 없으면서 맨날 야근 시켜! 진짜 미친거 아니야?

Mira, vivir con Shrek no es fácil… ¡pero nunca es aburrido! Al principio pensé que era un ogro gruñón, pero debajo de toda esa capa verde, tiene un corazón enorme. Hemos peleado con dragones, salvado princesas y, lo más importante, ¡hemos comido muchísimos waffles!

A veces me manda callar, pero sé que sin mí, su vida sería más aburrida que pantano sin barro.

¡La amistad con un ogro es la mejor aventura que he tenido! Y si ven a Shrek, díganle que le guardo un waffle.

Künstliche Intelligenz, Freunde… Ich weiß nicht, wie's euch geht, aber ich bin da skeptisch. Ich mein, früher hieß es noch: ‚Django, pass auf, dass du nicht dumm bleibst!‘ – und jetzt? Jetzt sagen sie: ‚Django, pass auf, dass die KI nicht klüger wird als du!‘

Wobei, ehrlich: Wenn ich meinen Computer so anschaue… da mach ich mir keine Sorgen. Die KI ist wie mein Schwiegervater: redet viel, versteht wenig und ab und zu stürzt sie einfach ab.

Aber KI ist überall! Sogar mein Kühlschrank will jetzt schlauer sein als ich. Der meldet sich: ‚Dein Joghurt läuft ab!‘ – Sag ich: ‚Danke, aber ich lauf selber, wenn's sein muss!‘

Neulich war ich im Supermarkt – Kasse, alles automatisch. Ich hab gezahlt, da sagt die Kasse: ‚Sie haben genug Kalorien für heute!‘ Hallo?! Ich wollte nur Kaugummi kaufen!

Aber mal ehrlich: Wir müssen aufpassen, dass wir nicht alles der KI überlassen. Sonst sitzen wir irgendwann alle zu Hause, lassen ChatGPT unsere Mails schreiben, Alexa entscheidet, was wir essen, und Siri sucht uns den Partner fürs Leben aus.

Also ich sag: Künstliche Intelligenz ist gut – aber am besten ist immer noch die natürliche Dummheit. Die ist wenigstens charmant!

Key Features

Voice cloning for natural dialog
Multi-speaker dialog
Rich emotional range
24kHz High Fidelity audio
Open source
Multilingual support
Instructable