In response to popular requests for better language understanding and broader language coverage since our first release last April, we have implemented many features in our agentic platform models for enterprise customers. Today, we are publicly releasing Higgs Audio v3, an STT / ASR foundation model. It supports 94 languages, covering all major languages and a large number of less widely spoken ones. The Higgs Audio v3 state-of-the-art Speech-to-Text (STT) capabilities include sophisticated language detection capability, advanced sentiment and semantic understanding.
This model utilizes an LLM backbone with a specialized encoder. It delivers industry-leading performance in Automatic Speech Recognition (ASR) and Speech Translation (AST), outperforming
whisper-v3-large by a large margin on key languages like English, Spanish, and French.Table 1 - English Benchmarks
| Model | librispeech_test_clean | librispeech_test_other | common_voice_15_en |
|---|---|---|---|
| Higgs Audio v3 STT | 1.55 | 3.20 | 6.80 |
| Qwen3-ASR-1.7B | 1.62 | 3.40 | 7.40 |
| Whisper-large-v3 | 2.10 | 4.26 | 10.17 |
Sources: LibriSpeech, Common Voice
Key Improvements
Higgs Audio v3 STT introduces significant architectural and data-centric advancements. Continuing to expand our high-quality data in our voice-bank, we leveraged over 1 million hours of labeled audio data for training of Higgs audio understanding models.
Next Generation Architecture
The model integrates an LLM backbone, coupled with a specialized encoder. This combination provides superior semantic understanding and acoustic robustness. Higgs Audio v3 STT has achieved better performance compared with popular state-of-the-art models across multiple languages.
Multilingual ASR Performance
CommonVoice-17 (1600 samples per language max)
| Model | En | Es | De | Fr | Ru | Ko | Ja |
|---|---|---|---|---|---|---|---|
| Higgs Audio v3 STT | 7.29 | 3.03 | 4.33 | 6.85 | 3.68 | 3.50 | 14.79 |
| Whisper-large-v3 | 11.33 | 4.49 | 6.12 | 11.29 | 5.43 | 2.65 | 15.03 |
Source: OpenSLR
Fleurs (1600 samples per language max)
| Model | En | Es | De | Fr | Ru | Ko | Ja |
|---|---|---|---|---|---|---|---|
| Higgs Audio v3 STT | 4.68 | 3.26 | 5.10 | 5.57 | 5.12 | 1.37 | 2.85 |
| Whisper-large-v3 | 5.94 | 4.23 | 5.92 | 7.02 | 5.86 | 1.45 | 3.81 |
Source: Google FLEURS
Low Latency Streaming
We implemented architectural changes to support chunk-prefill. This drastically cuts response latency, making it ideal for real-time applications compared to traditional full-sequence processing.
Dynamic Language Adaptation
The model supports ASR both with and without language hints. In "no hint" mode, it dynamically adapts to the input language, enabling seamless transcription of multilingual audio without prior configuration — a critical feature for production environments.
Robustness in Challenging Environments
Through advanced data augmentation techniques, Higgs Audio v3 STT demonstrates exceptional robustness in challenging acoustic environments, maintaining high accuracy even with background noise or poor recording quality.
Model Details
Model Type: Multimodal Audio Understanding (STT/AST)
Developed by: Boson AI
Tensor Type: BF16
Parameters: 2.07B
Tasks: ASR (Automatic Speech Recognition), AST (Speech Translation)
Language Support: 94 languages
Primary Languages
English, Chinese (Simplified), Spanish, German, French, Russian, Korean, Japanese
Extended Language Support
Cantonese, Abkhazian, Afrikaans, Amharic, Arabic, Assamese, Azeri, Bashkir, Belarusian, Bulgarian, Bengali, Breton, Catalan, Czech, Chuvash, Welsh, Danish, Divehi, Greek, Esperanto, Estonian, Basque, Farsi, Finnish, Frisian, Irish, Galician, Guarani, Hausa, Hebrew, Hindi, Hungarian, Armenian, Interlingua, Indonesian, Igbo, Italian, Georgian, Kazakh, Kirghiz, Ganda, Laothian, Lithuanian, Latvian, Macedonian, Malayalam, Mongolian, Marathi, Maltese, Nepali, Dutch, Norwegian (Nynorsk), Occitan, Oriya, Ossetian, Punjabi, Polish, Pashto, Portuguese, Rhaeto-Romanic, Romanian, Kinyarwanda, Sardinian, Slovak, Slovenian, Albanian, Serbian, Swedish, Swahili, Tamil, Telugu, Thai, Tigrinya, Turkmen, Turkish, Tatar, Twi, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Yiddish, Yoruba, Asturian, Central Kurdish (Sorani)
Get Started
Sample Code
Transcribe an audio file:
def transcribe_file(audio_path: str, language: str = "English") -> dict:"""Transcribe a single audio file.Args:audio_path: Path to an audio file (WAV, MP3, FLAC, etc.)language: Target language for transcription.Use "auto" to let the model detect the language.Returns:dict with keys: transcription, language, audio_duration_seconds,num_segments, processing_time_seconds, realtime_factor"""with open(audio_path, "rb") as f:resp = requests.post(f"{ASR_URL}/transcribe",files={"file": (Path(audio_path).name, f)},data={"language": language},)resp.raise_for_status()return resp.json()
Transcribe with auto language detection:
def transcribe_auto_language(audio_path: str) -> dict:"""Transcribe without specifying a language — the model will detect it."""return transcribe_file(audio_path, language="auto")
Intended Use Cases
Inference Directly
- High-Accuracy Transcription: Converting speech to text for meetings, lectures, and interviews with performance exceeding current state-of-the-art models.
- Real-time Captioning: Leverage the streaming/chunk-prefill capability to provide low-latency captions for live broadcasts or streams.
- Dynamic Language Processing: Transcribing audio streams with unknown or mixed languages using the model's dynamic multilingual adaptation capabilities.
Downstream Applications
- Voice Assistants: These models power the understanding layer of intelligent agents that need to process user queries with low latency.
- Content Indexing: Automatically generating metadata and searchable text from large audio/video archives.
- Accessibility Services: Providing real-time text alternatives for hearing-impaired users.
Out-of-Scope & Prohibited Use
This model is NOT intended for:
- Non-consensual Surveillance: Recording and transcribing conversations without the knowledge or consent of the participants.
- High-Stakes Decision Making: Relying solely on automated transcription for critical legal or medical records without human verification, as errors may still occur.
Closing
Higgs-Audio v3 reflects Boson AI's continued investment in building reliable, production-ready voice models, and marks a significant step in advancing the integration of generative AI within our enterprise agentic platform. As a state-of-the-art Speech-to-Text (aka ASR) foundation model, Higgs-Audio releases demonstrate our commitment to both cutting-edge model innovation and a customer-focused product philosophy.
Beyond transcription accuracy, the model delivers robust language detection, advanced sentiment analysis, and deep semantic understanding of spoken interactions. Together, these capabilities represent a significant advance in enterprise-grade Speech-to-Text performance and enable the next generation of intelligent voice-driven systems.
Acknowledgments
Model: Dongming Shen
Training: Dongming Shen, Gee Yang Tay
Data: Silin Meng, Gee Yang Tay, Yuzhi Tang, Dongming Shen
Evaluation: Ahmad Salimi, Wentao Ma, Ke Bai, Yi Zhu
Inference: Yizhi Liu, Abdulrahman Abdulrazzag, Xiaoyi Li, Weisu Yin
Lead: Alex Smola, Mu Li, Lindsey Allen
#higgs-audio
#speech-to-text
#asr
#v3
#multilingual