Boson AI Launches Higgs Audio 3.0 Speech-to-Text Model

The Boson AI TeamMarch 18, 2026

Today, we are publicly releasing Higgs Audio 3.0, a state-of-the-art Speech-to-Text (STT / ASR) foundation model. It supports 94 languages with sophisticated language detection, advanced sentiment and semantic understanding, and outperforms whisper-v3-large by a large margin on key languages.

In response to popular requests for better language understanding and broader language coverage since our first release last April, we have implemented many features in our agentic platform models for enterprise customers. Today, we are publicly releasing Higgs Audio 3.0, an STT / ASR foundation model. It supports 94 languages, covering all major languages and a large number of less widely spoken ones. The Higgs Audio 3.0 state-of-the-art Speech-to-Text (STT) capabilities include sophisticated language detection capability, advanced sentiment and semantic understanding.

This model utilizes an LLM backbone with a specialized encoder. It delivers industry-leading performance in Automatic Speech Recognition (ASR) and Speech Translation (AST), outperforming whisper-v3-large by a large margin on key languages like English, Spanish, and French.

Table 1 - English Benchmarks

Model	librispeech_test_clean	librispeech_test_other	common_voice_15_en
Higgs Audio 3.0 STT	1.55	3.20	6.80
Qwen3-ASR-1.7B	1.62	3.40	7.40
Whisper-large-v3	2.10	4.26	10.17

Sources: LibriSpeech, Common Voice

Key Improvements

Higgs Audio 3.0 STT introduces significant architectural and data-centric advancements. Continuing to expand our high-quality data in our voice-bank, we leveraged over 1 million hours of labeled audio data for training of Higgs audio understanding models.

Next Generation Architecture

The model integrates an LLM backbone, coupled with a specialized encoder. This combination provides superior semantic understanding and acoustic robustness. Higgs Audio 3.0 STT has achieved better performance compared with popular state-of-the-art models across multiple languages.

Multilingual ASR Performance

CommonVoice-17 (1600 samples per language max)

Model	En	Es	De	Fr	Ru	Ko	Ja
Higgs Audio 3.0 STT	7.29	3.03	4.33	6.85	3.68	3.50	14.79
Whisper-large-v3	11.33	4.49	6.12	11.29	5.43	2.65	15.03

Source: OpenSLR

Fleurs (1600 samples per language max)

Model	En	Es	De	Fr	Ru	Ko	Ja
Higgs Audio 3.0 STT	4.68	3.26	5.10	5.57	5.12	1.37	2.85
Whisper-large-v3	5.94	4.23	5.92	7.02	5.86	1.45	3.81

Source: Google FLEURS

Low Latency Streaming

We implemented architectural changes to support chunk-prefill. This drastically cuts response latency, making it ideal for real-time applications compared to traditional full-sequence processing.

Dynamic Language Adaptation

The model supports ASR both with and without language hints. In "no hint" mode, it dynamically adapts to the input language, enabling seamless transcription of multilingual audio without prior configuration — a critical feature for production environments.

Robustness in Challenging Environments

Through advanced data augmentation techniques, Higgs Audio 3.0 STT demonstrates exceptional robustness in challenging acoustic environments, maintaining high accuracy even with background noise or poor recording quality.

Model Details

Model Type: Multimodal Audio Understanding (STT/AST)

Developed by: Boson AI

Tensor Type: BF16

Parameters: 2.07B

Tasks: ASR (Automatic Speech Recognition), AST (Speech Translation)

Language Support: 94 languages

Primary Languages

English, Chinese (Simplified), Spanish, German, French, Russian, Korean, Japanese

Extended Language Support

Cantonese, Abkhazian, Afrikaans, Amharic, Arabic, Assamese, Azeri, Bashkir, Belarusian, Bulgarian, Bengali, Breton, Catalan, Czech, Chuvash, Welsh, Danish, Divehi, Greek, Esperanto, Estonian, Basque, Farsi, Finnish, Frisian, Irish, Galician, Guarani, Hausa, Hebrew, Hindi, Hungarian, Armenian, Interlingua, Indonesian, Igbo, Italian, Georgian, Kazakh, Kirghiz, Ganda, Laothian, Lithuanian, Latvian, Macedonian, Malayalam, Mongolian, Marathi, Maltese, Nepali, Dutch, Norwegian (Nynorsk), Occitan, Oriya, Ossetian, Punjabi, Polish, Pashto, Portuguese, Rhaeto-Romanic, Romanian, Kinyarwanda, Sardinian, Slovak, Slovenian, Albanian, Serbian, Swedish, Swahili, Tamil, Telugu, Thai, Tigrinya, Turkmen, Turkish, Tatar, Twi, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Yiddish, Yoruba, Asturian, Central Kurdish (Sorani)

Get Started

Sample Code

Transcribe an audio file:

def transcribe_file(audio_path: str, language: str = "English") -> dict:    """    Transcribe a single audio file.    Args:        audio_path: Path to an audio file (WAV, MP3, FLAC, etc.)        language:   Target language for transcription.                    Use "auto" to let the model detect the language.    Returns:        dict with keys: transcription, language, audio_duration_seconds,                        num_segments, processing_time_seconds, realtime_factor    """    with open(audio_path, "rb") as f:        resp = requests.post(            f"{ASR_URL}/transcribe",            files={"file": (Path(audio_path).name, f)},            data={"language": language},        )        resp.raise_for_status()        return resp.json()

Transcribe with auto language detection:

def transcribe_auto_language(audio_path: str) -> dict:    """Transcribe without specifying a language — the model will detect it."""    return transcribe_file(audio_path, language="auto")

Intended Use Cases

Inference Directly

High-Accuracy Transcription: Converting speech to text for meetings, lectures, and interviews with performance exceeding current state-of-the-art models.
Real-time Captioning: Leverage the streaming/chunk-prefill capability to provide low-latency captions for live broadcasts or streams.
Dynamic Language Processing: Transcribing audio streams with unknown or mixed languages using the model's dynamic multilingual adaptation capabilities.

Downstream Applications

Voice Assistants: These models power the understanding layer of intelligent agents that need to process user queries with low latency.
Content Indexing: Automatically generating metadata and searchable text from large audio/video archives.
Accessibility Services: Providing real-time text alternatives for hearing-impaired users.

Out-of-Scope & Prohibited Use

This model is NOT intended for:

Non-consensual Surveillance: Recording and transcribing conversations without the knowledge or consent of the participants.
High-Stakes Decision Making: Relying solely on automated transcription for critical legal or medical records without human verification, as errors may still occur.

Closing

Higgs Audio 3.0 reflects Boson AI's continued investment in building reliable, production-ready voice models, and marks a significant step in advancing the integration of generative AI within our enterprise agentic platform. As a state-of-the-art Speech-to-Text (aka ASR) foundation model, Higgs Audio releases demonstrate our commitment to both cutting-edge model innovation and a customer-focused product philosophy.

Beyond transcription accuracy, the model delivers robust language detection, advanced sentiment analysis, and deep semantic understanding of spoken interactions. Together, these capabilities represent a significant advance in enterprise-grade Speech-to-Text performance and enable the next generation of intelligent voice-driven systems.

Acknowledgments

This release was built through a cross-functional effort across model development, training, data preparation, evaluation, inference optimization, releasing, and data center infrastructure.

Technical Contributors: Dongming Shen, Gee Yang Tay, Silin Meng, Yuzhi Tang, Ahmad Salimi, Hao Yu, Wentao Ma, Ke Bai, Yi Zhu, Yizhi Liu, Abdulrahman Abdulrazzag, Xiaoyi Li, Weisu Yin, Mu Li

Release Contributors: Alex Chen, Lindsey Allen

#higgs-audio

#speech-to-text

#asr

#v3

#multilingual