Boson AI

Boson AI Launches Higgs-Audio V2.5 Voice Model

January 9, 2026•By Boson AI Team

Today, we are proud to launch Higgs-Audio V2.5, the latest iteration of Boson AI’s Audio model, designed to bring high-fidelity generation into production environments. Building on Higgs-Audio V2, this release combines improved efficiency with the stability required for real-world deployment.

With V2.5, we condensed the model architecture to 1B parameters while surpassing speed and accuracy of the prior 3B model. The result is achieved through a new alignment strategy using Group Relative Policy Optimization (GRPO) on our curated Voice Bank dataset, combined with improved voice cloning and finer-grained style control.

Key Improvements

Our new model benefits from four major architectural and data-centric improvements:

1. Lightweight Architecture

Higgs-Audio v2.5 introduces a more lightweight model architecture that reduces the parameter count from 3 billion to 1 billion, thus improving overall efficiency. The streamlined design enables faster inference and lower hardware requirements, while preserving expressive voice generation. Speed and accuracy exceed the v2 release.

A goal for lifelike conversations is to reduce the time it takes for a model to generate audio. v2.5 requires only 150ms to generate the first token, whereas v2 takes ~200ms to generate audio.

Speed matters when it's paired with precision. Both word error rates and speaker similarity are further improved in V2.5 (as evaluated on SEED-TTS and on CV3-EVAL).

SEED-TTS Eval Benchmark

	En WER ↓	En Sim ↑	Zh WER ↓	Zh SIM ↑
v2.5	1.38	73.09	1.35	76.22
v2.0	2.44	67.7	1.89	73.21

CV3-EVAL Benchmark

	En WER ↓	En Sim ↑	Ko WER ↓	Ko Sim ↑	Ja WER ↓	Ja Sim ↑	Zh WER ↓	Zh Sim ↑
v2.5	3.52	68.56	2.27	74.66	2.52	73.19	3.98	75.20
v2.0	13.40	61.93	5.15	71.54	46.14	61.83	12.21	71.82

	Fr WER ↓	Fr Sim ↑	De WER ↓	De Sim ↑	Es WER ↓	Es Sim ↑	It WER ↓	It Sim ↑	Ru WER ↓	Ru Sim ↑
v2.5	8.58	69.18	3.37	72.65	2.82	73.78	4.86	72.34	18.02	70.37
v2.0	18.57	61.67	6.23	67.31	5.50	70.59	12.13	63.81	72.09	60.44

2. Curated Data and Alignment

V2.5 leverages a purpose-built Voice Bank of self-cleaned, high-quality public speech data. This is then combined with alignment algorithms using Group Relative Policy Optimization (GRPO). This approach prioritizes performance in EN, ZH, KO, and JA, while strengthening zero-shot generalization across other major languages such as Spanish, German, French, and Italian.

3. Enhanced Voice Cloning

V2.5 introduces a refined pretraining objective, designed to strengthen speaker consistency. As a result, the model can now reproduce fine-grained timbre and prosodic features from shorter reference samples, improving voice cloning fidelity compared to the previous version.

4. Expressiveness Control Tags

Besides being 'naturally' talented, v2.5 adds explicit expressiveness control tags that provide direct, token-level guidance over speech style. These controls enable finer-grained prosody modulation, allowing generated speech to adapt more naturally to conversational context.

Model Details

Overall, our model is an autoregressive audio transformer (with audio tokenizer), using 1B parameters and FP32 format (thus offering further opportunities for acceleration due to quantization). Primary language support (via GRPO) is for English, Chinese, Korean and Japanese. Secondary languages (zero shot generalization) include Spanish, German, French and Italian.

Getting Started

Deployment options

Higgs-Audio v2.5 supports flexible deployment across both managed and self-serve environments. The model is available through:

Microsoft Foundry AI Model Catalog (available soon) for enterprise-grade managed deployment.
Eigen AI for direct inference.
Boson AI for model customization and fine-tuning

This allows teams to choose the path that best fits their requirements. For an interactive demo try out https://voice.boson.ai/demo

How to Use

The best way to experience the model from the command line is by using our generation script: generation.py, with the following options:

--transcript transcript.txt               Transcript--out_path generation.wav                 Output file--scene_prompt prompt.txt                 Scene prompt--ref_audio voice1,voice2                 Voice(s) or audio profiles, such as                                          profile:male_en_british--seed 12345                              RNG seed--temperature 0.3                         RNG temperature--chunk_method word                       Chunking used for long form audio--generation_chunk_buffer_size 2          Size of the buffer per chunk--ras_win_len 0                           Used for music generation

Let's look at a few examples of this in action.

Single-speaker Audio Generation

(Shallow) Voice clone

The model will read the transcript with the same voice as in thereference_audio.

python3 generation.py --transcript transcript/single_speaker/en_dl.txt \--ref_audio broom_salesman --seed 12345 --out_path generation.wav

We have some example audio prompts stored in voice_prompts. Feel free to pick one in the folder and try out the model. Here's another example that uses the voice of Belinda. You can also add new own favorite voice in the folder and clone the voice.

python3 generation.py --transcript transcript/single_speaker/en_dl.txt \--ref_audio belinda --seed 12345 --out_path generation.wav

Cross-lingual voice clone

Voice cloning works not only within a given language but also across languages. This example demonstrates voice cloning with a Chinese prompt, where the synthesized speech is in English. We reduced the temperature to 0.3 to reduce variability.

python3 generation.py --transcript transcript/single_speaker/en_dl.txt \--scene_prompt empty --ref_audio zh_man_sichuan --temperature 0.3 \--seed 12345 --out_path generation.wav

Smart voice

Whenever there is no reference voice provided, the model will choose a random voice suitable for the prompt (or scene). It works well in multiple languages (e.g. English and Chinese, as in the examples below).

python3 generation.py --transcript transcript/single_speaker/en_dl.txt \--seed 12345 --out_path generation.wav

python3 generation.py --transcript transcript/single_speaker/zh_ai.txt \--seed 12345 --out_path generation.wav

Speaker characteristics

You can describe the speaker characteristics by text (rather than a voice sample). For a full set of styles see voice_prompts/profile.yaml. For instance, the example below generates a male British voice (remove the RNG seed to experience other voices).

python3 generation.py --transcript transcript/single_speaker/en_dl.txt \--ref_audio profile:male_en_british --seed 12345 \--out_path generation.wav

Chunking for long-form audio generation

For long form audio we need to chunk the text into segments and render them one by one. This risks losing previous context. To ensure that this doesn't happen, the generation script uses the previous segment as a reference. The example below generates the first five paragraphs of our Higgs Audio V1 release text.

python3 generation.py --scene_prompt scene_prompts/reading_blog.txt \--transcript transcript/single_speaker/en_higgs_audio_blog.md \--ref_audio en_man --chunk_method word --temperature 0.3 \--generation_chunk_buffer_size 2 --seed 12345 --out_path generation.wav

Sound Generation

The pretrained model demonstrates emergent capabilities, such as sound generation. These are included as experimental prompts (to be enhanced in future versions of Higgs-Audio).

Hum a tune with the cloned voice

python3 generation.py \--transcript transcript/single_speaker/experimental/en_humming.txt \--ref_audio en_woman --ras_win_len 0 --seed 12345 \--out_path generation.wav

Read the sentence while adding background music (BGM)

python3 generation.py \--transcript transcript/single_speaker/experimental/en_bgm.txt \--ref_audio en_woman --ras_win_len 0 --seed 12345 \--ref_audio_in_system_message --out_path generation.wav

Multi-speaker Audio Generation

Smart voice

Just like for single voice situations, Higgs-Audio can also generate dialogues with multiple voices in a zero-shot fashion. See the transcript in transcript/multi_speaker/en_argument.txt for details. The speakers are annotated with [SPEAKER0] and [SPEAKER1].

python3 generation.py --transcript transcript/multi_speaker/en_argument.txt \--seed 12345 --out_path generation.wav

Multi-voice clone

Again, just as with a single speaker, you can also zero-shot clone multiple speakers. Here's an example that puts reference audios in the system message and prompt the model iteratively. You can hear "Belinda" arguing with "Broom Salesman". Again, for long generations (such as en_higgs.txt) you need to allow for chunking.

python3 generation.py --transcript transcript/multi_speaker/en_argument.txt \--ref_audio belinda,broom_salesman --ref_audio_in_system_message \--chunk_method speaker --seed 12345 --out_path generation.wav

python3 generation.py --transcript transcript/multi_speaker/en_higgs.txt \--ref_audio broom_salesman,belinda --ref_audio_in_system_message \--chunk_method speaker --seed 12345 --chunk_max_num_turns 2 \--out_path generation.wav

Closing

Higgs-Audio V2.5 reflects Boson AI's continued commitment to building reliable, production-ready voice models. By combining a lightweight architecture with improved alignment and expressive control, V2.5 is designed to support real-world deployment at scale.

#higgs-audio

#text-to-speech

#voice-generation

#v2.5