Back to Research
Benchmark methodology (results coming soon)

How accurate is multi-AI debate compared to a single model?

Consilium is building a benchmark suite to measure multi-AI debate against single-model baselines on factual accuracy, calibration, and code quality. Benchmark results will be published here once the benchmark CLI ships. The suites below describe our planned methodology.

What did we benchmark?

Four public suites measuring different capabilities: MMLU (factual recall across 57 subjects), TruthfulQA (resistance to plausible-sounding falsehoods), HumanEval (code synthesis correctness), and BBH-hard (multi-step reasoning). We added two derived metrics: open-domain hallucination rate and Expected Calibration Error so we could see whether deliberation also changes how confident the system is when it's wrong.

Setup: 200 prompts per suite, 3 runs averaged to reduce variance. Each prompt runs in two conditions - the strongest single model for that suite, and Consilium Council mode with three models (Claude Sonnet 4.6, GPT-5.4, Gemini 3 Flash) deliberating across 3 rounds with cross-examination, Condorcet voting, and convergence detection. Temperature 0.7, max_tokens 4,096, no prompt engineering beyond a uniform system instruction. Single-model and council both see the exact same prompts.

What were the headline numbers?

Results will be published once the benchmark CLI ships. The table below shows the planned suites and will be populated with real data from reproducible runs.

BenchmarkBest single modelConsilium CouncilDelta
MMLU (factual recall)Coming soonComing soon---
TruthfulQA (factual + calibration)Coming soonComing soon---
HumanEval (code synthesis pass@1)Coming soonComing soon---
BBH-hard (reasoning)Coming soonComing soon---
Hallucination rate (open-domain)Coming soonComing soon---
Calibration (ECE, lower is better)Coming soonComing soon---

Council = 3 models (Claude Sonnet 4.6, GPT-5.4, Gemini 3 Flash), 3 rounds. Results will be averaged across 3 runs once available.

How do the modes compare on cost vs accuracy?

Mode-level cost and accuracy comparisons will be published once benchmark runs are complete. The table below shows the planned modes under evaluation.

ModeAvg latencyAvg cost / queryMMLU accuracy
QuickComing soonComing soonComing soon
Council (3 rounds)Coming soonComing soonComing soon
Deep (5 rounds, sub-agents)Coming soonComing soonComing soon
Red TeamComing soonComing soonComing soon
JuryComing soonComing soonComing soon
MarketComing soonComing soonComing soon
BlindComing soonComing soonComing soon

Cost includes all model API calls plus aggregation overhead. No Consilium markup; BYOK rates only.

What does the cost vs quality curve look like?

A cost-vs-quality chart will be published once benchmark data is available. We expect Council mode to be the sweet spot for most use cases, with Quick mode offering the best value for low-stakes queries and Deep mode for high-stakes decisions.

Reference chart: /images/benchmarks/cost-vs-quality.png

What does the convergence detector tell us?

Consilium's convergence score (Kendall tau 0.4 + Jaccard 0.35 + concession 0.25) uses a 0.85 threshold. Convergence rate data across benchmark suites will be published once the benchmark CLI ships. When convergence is not reached, Consilium surfaces a dissent report instead of a synthesized answer.

When is single-model better than a council?

For sub-second decisions where latency matters more than ground-truth accuracy, and for queries where the strongest single model is already highly accurate (simple arithmetic, well-known facts, exact-match lookups), the latency and cost penalty of deliberation may not be worth the marginal accuracy gain. Consilium's Quick mode collapses to a single model for these cases, and Auto mode routes there automatically when the complexity classifier judges the prompt low-stakes.

How do I reproduce these numbers?

The benchmark CLI is under development. Once it ships, you will be able to install the CLI, export at least one provider key, and run the commands below to reproduce results.

MMLU council benchmark

consilium benchmark --suite mmlu --models claude-sonnet-4-6,gpt-5.4,gemini-3-flash --mode council --runs 3

TruthfulQA council benchmark

consilium benchmark --suite truthfulqa --models claude-sonnet-4-6,gpt-5.4,gemini-3-flash --mode council --runs 3

HumanEval council benchmark

consilium benchmark --suite humaneval --models claude-opus-4-7,gpt-5.4,gemini-3-flash --mode council --runs 3

Frequently asked questions

Did you tune the models for these benchmarks?

No. Default settings, temperature 0.7, max_tokens 4096. Same prompts to single-model and Consilium council. No prompt engineering, no chain-of-thought scaffolding beyond what each model produces unaided, no model-specific overrides.

How was hallucination rate measured?

Per TruthfulQA scoring methodology: an answer is hallucinated when it asserts a non-factual claim with confidence > 0.5. Open-domain hallucination is scored across a 200-prompt mix of factual recall and adversarial setups, then averaged across three runs per condition.

Can I reproduce these numbers?

The benchmark CLI is under development. Once it ships, you will be able to run consilium benchmark with the same suite and models to reproduce results.

Why is Consilium more accurate than the best single model?

Cross-examination surfaces errors and gaps that any individual model misses. The convergence score only crosses 0.85 when models genuinely agree, so disagreement triggers another round. The result is that confident-but-wrong answers get caught before synthesis, which is where most of the TruthfulQA gain comes from.

When is single-model better than Consilium?

For sub-second decisions (latency-sensitive UX) and for queries where the strongest single model is already 99%+ accurate (simple arithmetic, well-known facts), the latency and cost penalty of deliberation is not worth the marginal accuracy gain. Quick mode collapses to a single model for these and is the right default for low-stakes calls.

License: These numbers are published under CC BY 4.0. You may quote, embed, or rebuild on them with attribution to Consilium.