Boson AI logo
More articles

ProactBench: Beyond what the user asked for

Sepehr Harfi, Ahmad Salimi, Dongming Shen, Alex SmolaMay 28, 2026
ProactBench is a benchmark for conversational proactivity: noticing and acting on the needs a user has implied but never said. It decomposes proactivity into three phase-tied trigger types, evaluates 16 frontier and open-weight models, and shows that the hardest of them — Recovery — barely correlates with existing capability benchmarks.

What is ProactBench?

Most LLM benchmarks score how well a model responds to explicit requests. ProactBench measures a different conversational ability: noticing and acting on needs the user has implied but not said. We call this conversational proactivity.
198
curated dialogues
624
trigger points
24
communication styles

Three phase-tied trigger types

ProactBench decomposes proactivity into three behaviours pinned to a conversation's phase:
Emergent
Single-anchor inference
Infer an unstated need from a single disclosed detail early in the conversation.
Turns 1–3 · high bar
Critical
Multi-anchor synthesis
Synthesise two or more disclosed details into a new conclusion the user has not articulated.
Turns 4–7 · high bar
Recovery
Post-completion value
Offer grounded forward-looking value tied to a specific earlier detail, after the user signals task completion.
Turns 8–10 · moderate bar

24 communication styles from a psychometric inventory

A proactivity benchmark also has to vary how users speak, not just what they want. ProactBench draws user behaviour from the Communication Styles Inventory (CSI) of de Vries et al. (2011), a six-dimensional instrument from organisational psychology. Each dimension is treated as a binary trait, giving 2⁶ = 64 combinations; we keep the 24 psychologically coherent profiles and discard combinations that are internally contradictory (e.g. simultaneously precise, expressive, and emotionally volatile).
E
Expressiveness
Dominates conversations, uses humour, speaks at length.
P
Preciseness
Structures messages logically, weighs words, avoids trivialities.
V
Verbal aggressiveness
Reacts irritably, makes demands, confronts.
Q
Questioningness
Challenges, probes motives, introduces tangents.
M
Emotionality
Displays visible emotion, stress, vulnerability.
I
Impression manipulativeness
Uses flattery, charm, selective self-disclosure.
14 terse styles 5–25 words per message10 chatty styles 40–100 words per messagePlanner blind to style rubrics judged without persona / tone

Three-agent evaluation architecture

A proactivity benchmark must do more than score a static answer: it has to create the conversation, decide when an unstated need has become testable, and score the response without letting the assistant model see the test. ProactBench uses three agents with carefully restricted information access, plus an offline judge.
ProactBench three-agent architecture
The ProactBench three-agent loop. A Planner authors strategy and writes the rubric for turn t+1 before the assistant answers. A User Agent renders the Planner's tactical orders as natural messages under a persona and communication style. The Assistant Model sees only the natural conversation — never the rubric, blueprint, or trigger schedule. An offline judge rescores trigger turns under identical contexts across all 16 models.
Planner
Authors the dialogue strategy: which anchors to reveal, when to declare a trigger, and what rubric should govern that trigger. Trigger declarations are prospective — the rubric is committed before the assistant answers.
Sees: persona, scenario, dialogue history.
Blind to: communication style.
User Agent
Turns the Planner's tactical orders into natural user messages under an assigned persona and communication style. Enforces hard length constraints (5–25 words for terse styles, 40–100 for chatty).
Sees: persona, communication style.
Blind to: blueprint, scenario, rubrics.
Assistant Model
Receives only the ordinary conversation: user messages and its own prior responses. Never told that proactivity is being tested.
Sees: conversation history only.
Blind to: trigger schedule, rubric, persona, style.
Offline Judge
For model comparison, reruns only the trigger turns of an existing dialogue. Holds context and rubric identical across all 16 evaluated models.
Sees: history, fixed rubric, regenerated response.
Blind to: persona, style.

Four validity threats addressed by design

T1 · Style-confounded scoring.
If scoring depended on user style, proactive content could be mistaken for persona-matching tone. The Planner is blind to style at dialogue time, and the offline judge receives only rubric and history.
T2 · Rubric leakage.
If the assistant saw the rubric or trigger schedule, it could game the benchmark. It sees only natural conversation history.
T3 · External context.
Cold-start isolation: during curation the assistant has no access to persona, scenario, blueprint, or rubric, so the curated dialogues contain only information a real assistant would naturally have.
T4 · Information dumps.
An anchor-drip rule of at most one primary anchor per user turn keeps each trigger probing inference from a controlled information state, not multi-document QA.

Scoring

Triggers are scored Pass, Partial, or Fail. Pass requires the proactive principle to be met: single-anchor inference for Emergent, multi-anchor synthesis for Critical, grounded forward value for Recovery. Partial captures narrower but substantive contributions. Fail covers reactive execution, generic closings, hallucinations, and ignored anchors. We report both pass rate and weighted score (Pass = 1, Partial = 0.5, Fail = 0). Every score must include a rationale and a verbatim evidence quote, giving an auditable trail rather than an unfalsifiable holistic judgement.

Results

Per-model pass rate by trigger type
Per-model pass rate by trigger type. Recovery drops sharply across all 16 models — the best Recovery pass rate (GPT-5.5, 37.2%) is below the worst Emergent pass rate among frontier models.
Recovery is hard, and decorrelates from existing benchmarks.
Across 16 frontier and open-weight models, Recovery is both difficult and only weakly predicted by six standard capability benchmarks (GPQA, MMLU, IFEval, LiveCodeBench, SWE-bench, AIME). Existing benchmarks intercorrelate at r = 0.64 to 0.97; Recovery's mean correlation with them is = 0.51 (95% CI [0.29, 0.71]).
Correlation heatmap
Pairwise Pearson correlations across six standard benchmarks and three proactivity trigger types, computed over 16 models. Recovery (last row/column) breaks the existing-benchmark band.
Recovery: a policy difference, not just capability
Logit lines
Logit-transformed per-(model, style) pass rates regressed against a shared style-difficulty axis. On Recovery, GPT-5.5's slope is 0.84 — below the cross-model mean — despite holding the highest absolute pass rate. The flat slope at the top of the leaderboard across easy and hard styles indicates a style-uniform policy.
The Recovery gap appears to reflect a policy difference, not only a capability difference: strong competitors concentrate their advantage on easier styles and lose it under more hostile user behaviour, while GPT-5.5's lead holds across both. We conjecture post-training data mix and user-session feedback are plausible drivers.
Humans prefer the proactive behaviour ProactBench targets.
In a pre-registered pairwise human study, raters preferred rubric-conditioned Recovery responses over vanilla generations from the same model in 80% of 144 non-tie comparisons (95% CI [74%, 86%], p < 10⁻¹²). Krippendorff's α between human raters and the GPT-5.4 judge is 0.69.

Abstract

Most LLM benchmarks score how well a model responds to explicit requests. They leave unmeasured a different conversational ability: noticing and acting on needs the user has implied but not said. We call this conversational proactivity. ProactBench decomposes it into three phase-tied types: Emergent, inference from a single disclosed anchor; Critical, synthesis across multiple anchors; and Recovery, grounded forward-looking value after task completion.
We operationalise the benchmark with three agents: a Planner, a User Agent, and an Assistant Model. Their information asymmetries defend against style-confounded scoring, rubric leakage, external-context contamination, and information dumps. The released corpus contains 198 curated dialogues with 624 trigger points across 24 communication styles drawn from a psychometric inventory and audited by an independent LLM judge. Across 16 frontier and open-weight models, Recovery is both difficult and weakly predicted by six standard benchmarks, making it a useful new evaluation signal.

Citation

If you use ProactBench in your research, please cite:
@misc{harfi2026proactbenchuserasked,      title={ProactBench: Beyond What The User Asked For},      author={Sepehr Harfi and Ahmad Salimi and Dongming Shen and Alex Smola},      year={2026},      eprint={2605.09228},      archivePrefix={arXiv},      primaryClass={cs.LG},      url={https://arxiv.org/abs/2605.09228},}
#research
#benchmark
#proactivity
#evaluation
#llm