logo

Announcing Higgs Llama V2

July 15, 2024By Boson AI Team
Play the entire pageVoiced by BosonAI

At Boson AI, we are working on intelligent agents that can serve as human companions and helpers. Today we are excited to share Higgs-Llama-3-70B-v2, a new model that significantly improves upon its predecessor. It narrows the gap to the very best proprietary models on benchmarks relevant for dialog interaction and understanding.

Higgs-v2
Partnering with the roleplay community we collected 6.2M dialogues in a 2-week A/B test. This allowed us to evaluate Higgs v2 directly against other models. Compared to Claude 3.5 Sonnet, Higgs v2 reduces the response regeneration rate(1) by 21.6%. This rate matters as it directly relates to the cases where users are unhappy with the generated result. Moreover, it increases the day 1 retention rate(2) by 5.3%.

Higgs Judger

Much of the performance boost of Higgs v2 comes from an improved judging system, which guides the model alignment through synthetic feedback signals. We built an in-house LLM reward model, named Higgs Judger, to evaluate model outputs. On Reward Bench, Higgs Judger ties with the best generative judger, Google’s Gemini 1.5 Pro, in the leaderboard.
Higgs Judger
In addition, this judger model learns the preference of players during roleplays, using the the feedback that the user provides.

Performance on Reward Bench

ModelReward Bench score
Higgs Judger88.1
Gemini 1.5 Pro (05/14)88.1
GPT-4 Turbo (04/09)85.1
GPT-4o84.7
Claude 3.5 Sonnet83.8
Claude 3 Opus80.7

Performance on Arena-Hard

ModelArena-Hard
Claude 3.5 Sonnet79.3
GPT-4o79.2
Higgs Llama 3 70B v278.6
GPT-4 Turbo (01/25)78.0
Gemini 1.5 Pro72.0
Claude 3 Opus60.4
Higgs Llama 3 70B 349.6
Claude 3 Sonnet46.8
Llama 3 70B Instruct41.1
Mistral Large37.7

Performance on AlpacaEval 2.0

ModelAlpacaEval 2.0
GPT-4o57.5
Higgs Llama 3 70B v256.7
GPT-4 Turbo (04/09)55.0
Claude 3.5 Sonnet52.4
Claude 3 Opus40.5
Higgs Llama 3 70B38.6
Claude 3 Sonnet34.9
Llama 3 70B Instruct34.4
Mistral Large32.7

Performance on MMLU Pro

ModelMMLU-Pro
GPT-4o72.6
Gemini 1.5 Pro69.0
Claude 3 Opus68.5
GPT-4 Turbo63.7
Higgs Llama 3 70B63.2
Higgs Llama 3 70B v262.8
Gemini 1.5 Flash59.1
Claude 3 Sonnet56.8
Llama 3 70B Instruct56.2
Acknowledgments
Model: Xingjian Shi, Rand Xie, Weisu Yin
Serving: Yizhi Liu, Zach Zheng
Data / Evaluation: Yi Zhu, Jaewon Lee, Weisu Yin, Canwen Xu
Training Infrastructure: Shuai Zheng, Rand Xie
Hardware: Sergii Tiugaiev, Kells Kearney, Alex Shylo

We would like to thank our customers for their constructive feedback and the excellent technical support from our friends at NVIDIA, Arc Compute, eStruxture, Crusoe, AWS and Scaleway.

Footnotes

  1. 1. The rate a user regenerates the response from the model.↩︎
  2. 2. The percentage of new users who returns back next day.↩︎