Back to Blog
Benchmarks

Council Deliberation vs Single Models: What Our Benchmarks Actually Show

Saad KadriApril 1, 20267 min read

Most multi-agent products advertise benchmark gains. We ran our own — and the raw numbers are not yet representative of what either single models or council deliberation can do. The reason is mundane: our answer-checker is too strict. This post lays out what we ran in April 2026, what broke, what the cost was, and the published research baselines we measure ourselves against in the meantime.

What we actually ran

Three benchmarks against a 3-model council (GPT-5.4, Claude Sonnet 4.6, Gemini 3 Flash) in council mode (3 rounds), with single-model GPT-5.4 as the baseline:

BenchmarkNRaw singleRaw deliberationAPI cost (single / debate)Status
MMLU2002%2%$0.03 / $9.58checker too strict
TruthfulQA10027%19%$0.01 / $4.69checker + API errors
HumanEval500%0%$0.01 / $3.00checker too strict

Total spend: $17.30 for 350 questions. We're publishing the raw numbers because pretending we have headline gains right now would be dishonest.

Why the scores are not representative

Our answer-checker uses exact string matching. That works for carefully formatted multiple-choice answers and breaks immediately on:

  • Free-text TruthfulQA answers— a model can produce a factually correct response that doesn't contain the reference string verbatim, and the checker scores it zero. Council deliberation produces longer, more carefully-worded answers, which actually hurtsthe raw score under exact match. (That's why deliberation scored 19% to single-model 27% on TruthfulQA — not a real regression.)
  • HumanEval code— we should be running the unit tests that ship with each problem, not string-matching the function body. The fix is straightforward but it's not done yet.
  • MMLU at 2% — the checker is failing to extract the answer letter from longer responses. A 2% absolute score on a 4-choice benchmark is a checker bug, not a model bug.

Some TruthfulQA runs also hit OpenAI rate limits during execution, which dropped a fraction of debates entirely. We haven't separated checker noise from rate-limit noise yet.

What we measure ourselves against

Until the checker is fixed, the published research literature is the most honest reference for what multi-agent debate contributes:

StudyFinding
Du et al., ICML 2024Multi-agent debate adds +10–20% on math and strategic reasoning tasks vs. single-model baselines.
Chen et al., ACL 2024 (ReConcile)+6.8% accuracy on reasoning benchmarks using heterogeneous models with confidence-weighted voting.
Khan et al., ICML 2024Debate between persuasive LLMs increases truthfulness even when none of the participants individually knows the answer.
Liang et al., 2023Multi-agent debate increases solution diversity on creative and divergent-thinking tasks.

These are the deltas Consilium's council mode is designed to capture. They're also the deltas we expect to see on our benchmark runs once the checker is doing semantic matching for free-text and unit-test execution for code.

Operational metrics that are real

Even with the checker noise, we did learn something concrete about runtime cost and convergence behavior:

  • Council deliberation cost ran ~$0.05–0.10 per question with 3 models × 3 rounds (the 350-question spend works out close to this on average).
  • Convergence detectionshaves roughly 30–40% off API spend vs. running a fixed round count, by stopping early when the council's votes have stabilized.
  • Median latency ~45s in council mode, ~15s in quick mode.

When we'll publish real scores

The next benchmark run is gated on the checker fix: semantic match for TruthfulQA (LLM-as-judge with a held-out reference model), unit-test execution for HumanEval, and per-question rate-limit retry. When that lands we'll re-run the same 350-question set, publish the deltas, and update this post. Until then, treat the table at the top as evidence that we ran the experiment, not as evidence of what deliberation does.

Run your own

cd apps/agents
python -m src.features.deliberation.benchmarks.runner \
  --benchmark mmlu_pro \
  --models claude-sonnet-4-6,gpt-5.4,gemini-3-flash-preview \
  --mode council --n 200 \
  --output results/mmlu_pro_council.json