Research
The peer-reviewed science behind multi-agent deliberation. Every Consilium feature maps to a specific finding from published research at ICML, ACL, AAAI, and the AI safety community.
Abstract
This paper demonstrates that having multiple LLM instances propose answers, debate their reasoning, and iteratively revise their responses leads to significant improvements in factual accuracy and mathematical reasoning. The debate mechanism encourages models to identify and correct errors in each other's reasoning chains, producing more reliable outputs than any single model run. The authors show that the improvement scales with both the number of agents and the number of debate rounds, with diminishing returns after 3-4 rounds.
Key Findings
- Multi-agent debate improves factual accuracy by 8-15% across benchmarks
- GSM8K math reasoning improved from 82% to 91% with 3-agent debate
- MMLU scores improved 8-12% compared to single-model baselines
- Debate is most effective on questions requiring multi-step reasoning
- Improvement scales with agent count and rounds, diminishing after 3-4 rounds
Methodology
Multiple LLM instances independently propose answers, then engage in structured debate rounds where they critique and revise each other's responses. Convergence is measured by answer stability across rounds. Experiments run across GSM8K, MMLU, and TruthfulQA benchmarks with varying agent counts (2-6) and round counts (1-6).
Benchmark Results
GSM8K: 82% → 91% (3 agents, 3 rounds). MMLU: +8-12% over single model. TruthfulQA: +14% on adversarial questions.
Consilium Implementation
Council and Deep modes implement this paper's debate protocol directly. In Council mode, 3+ models deliberate across multiple rounds with cross-examination. Deep mode extends this with sub-agent research for complex questions requiring extended reasoning chains. Consilium's convergence detection (Kendall tau + Jaccard + concession tracking) formalizes the paper's answer stability measurement into a mathematical threshold (>= 0.85).
Abstract
This ICML 2024 Best Paper award winner investigates what happens when debaters have asymmetric capabilities — when one model is more persuasive than another. The key finding is that even with one more persuasive debater, structured debate protocols still converge on truthful answers because truth has a natural advantage in debate. Truthful arguments are easier to defend under repeated scrutiny, while false arguments require increasingly elaborate justifications that eventually collapse under adversarial pressure.
Key Findings
- Truth has a natural advantage in structured debate — truthful positions are easier to defend
- Even asymmetric debates (strong vs. weak model) converge on correct answers
- Structured protocols prevent persuasive but incorrect arguments from dominating
- Validates debate as a scalable oversight method for AI alignment
- Judges improve accuracy when evaluating debate transcripts vs. direct answers
Methodology
Asymmetric debate experiments where models of varying capability argue for correct and incorrect positions. Human and AI judges evaluate debate transcripts without knowing which model argued which side. Experiments measure judge accuracy across multiple debate formats: single-turn, multi-turn, and cross-examination. The study controls for model capability by pairing GPT-4 against Claude and measuring convergence rates.
Benchmark Results
Judge accuracy: 76% (direct) → 88% (after debate). Asymmetric pairing: truth-side wins 84% of debates regardless of model strength.
Consilium Implementation
Blind mode implements this paper's insight by hiding model identities during evaluation, preventing brand bias. The judge evaluates arguments purely on merit using multiple argument orderings. This ensures a more persuasive model cannot win through reputation alone — only through the strength of its evidence. Consilium's confidence calibration formula (stability * (1 - concession_rate) * (1 - 0.3 * qualification_rate)) directly operationalizes the paper's finding that explanation stability predicts truthfulness.
Abstract
ReConcile proposes a round-table discussion framework where diverse LLMs engage in multi-round discussions, share confidence scores, and update their positions based on group deliberation. The paper demonstrates that this approach consistently outperforms both the best individual model and simple ensemble methods like majority voting. The key insight is that confidence-weighted consensus captures more information than simple aggregation — models that are uncertain about their answers appropriately defer to more confident peers.
Key Findings
- 3-10% improvement over the best individual model across reasoning benchmarks
- Confidence-weighted voting outperforms simple majority voting by 5-7%
- Diverse model ensembles (different architectures) perform better than same-model ensembles
- Round-table format enables models to learn from each other's reasoning strategies
- Optimal performance at 3-5 models; beyond 5, diminishing returns
Methodology
Round-table conference format where diverse LLMs discuss problems across multiple rounds, sharing confidence-weighted votes. Models update their positions based on the group's reasoning, with final answers determined by confidence-weighted consensus. Experiments compare same-architecture vs. cross-architecture ensembles across StrategyQA, ARC, and MATH benchmarks.
Benchmark Results
StrategyQA: +7% over best single model. ARC-Challenge: +5%. MATH: +10% on hardest problems. Cross-architecture ensembles: +3% over same-architecture.
Consilium Implementation
Council mode implements the round-table format with Condorcet and Borda count voting systems. Consilium extends the paper's approach with confidence-weighted ballots, Ranked Pairs tiebreaking, and Copeland scoring for comparative analysis — applying formal social choice theory to the consensus mechanism. The paper's finding that diverse architectures outperform same-model ensembles is why Consilium supports 5 providers (Anthropic, OpenAI, Google, xAI, Groq) for cross-architecture deliberation.
Abstract
This foundational paper proposes debate as an alignment technique where two AI systems argue for opposing positions while a human (or AI) judge evaluates. The key insight is that debate enables judges to assess the quality of AI outputs even on tasks they cannot solve directly — the adversarial structure forces both sides to surface the strongest evidence, making evaluation tractable. The paper provides theoretical analysis showing that optimal play in debate converges on truthful answers under reasonable assumptions about the judge's ability to verify evidence.
Key Findings
- Debate enables evaluation of AI outputs on tasks beyond the judge's direct capability
- Adversarial structure incentivizes surfacing the strongest evidence for each position
- Debate scales better than direct human oversight for complex tasks
- Optimal play in debate converges on truth under reasonable verification assumptions
- The approach provides a natural mechanism for identifying and preserving minority opinions
Methodology
Two AI systems debate opposing positions on a given question. A judge (human or AI) evaluates the debate transcript and selects the winning position. The adversarial incentive structure ensures both sides present their strongest arguments. Theoretical analysis proves convergence properties under various judge capability assumptions.
Benchmark Results
Theoretical: optimal debate converges to truth with O(log n) judge queries for n-bit answers. Empirical validation in subsequent papers (Khan et al., Du et al.).
Consilium Implementation
Red Team mode implements the attack/defend/judge framework directly. Models take adversarial positions, challenge each other with typed attacks (FACTUAL_ERROR, MISSING_EVIDENCE, FLAWED_LOGIC), and a judge synthesizes the final assessment. Jury mode extends this with mandatory dissent — ensuring minority opinions are preserved even when the majority reaches consensus. The paper's theoretical convergence guarantees motivate Consilium's mathematical convergence threshold (0.85).
Abstract
This paper explores how structured discussion between LLMs produces more creative and diverse outputs than individual generation. By having models propose ideas, critique each other's proposals, and build on promising directions collaboratively, the discussion framework overcomes the tendency of individual models to produce safe, predictable outputs. The authors demonstrate that role assignment — giving models specific personas during discussion — further improves creative diversity by forcing exploration of perspectives that a single model would not naturally adopt.
Key Findings
- Structured multi-model discussion produces 23% more creative outputs (human evaluation)
- Discussion format encourages exploration of unconventional approaches
- Role-play assignment increases creative diversity by forcing perspective shifts
- Collaborative refinement improves both novelty and quality simultaneously
- Models build on each other's ideas in ways single models cannot self-generate
Methodology
Multiple LLMs engage in structured discussion rounds: initial ideation, critique and exploration, collaborative refinement. Models are assigned distinct roles (e.g., 'optimist', 'skeptic', 'domain expert') to force perspective diversity. Creativity metrics (novelty, diversity, quality, usefulness) are evaluated by both human judges and automated metrics across story generation, product ideation, and problem-solving tasks.
Benchmark Results
Story generation novelty: +23% (human eval). Product ideation: +31% unique ideas. Problem solving: +18% solution diversity. Role-play vs. no-role: +12% creative diversity.
Consilium Implementation
Market mode's probability aggregation mechanism encourages creative divergence before convergence. Models stake credibility on positions, which incentivizes novel perspectives that can differentiate from the consensus. The prediction market structure rewards models that identify valuable unconventional insights early. The paper's role-play finding informs Consilium's Red Team role assignment (attacker, defender, judge) and the dialectical structure of Blind mode (risk advocate vs. acceptability advocate).
Abstract
This paper extends the debate framework to address computational efficiency, demonstrating that debate can be made practically efficient while maintaining safety guarantees. The 'doubly-efficient' property ensures that both the debaters and the judge can operate within reasonable computational budgets, making debate-based oversight viable for production systems. The authors propose complexity-based routing where simple questions skip full debate and only complex, high-stakes questions receive the full multi-round treatment.
Key Findings
- Debate protocols can be optimized for cost without sacrificing safety guarantees
- Complexity-based routing reduces cost by 60-80% on simple questions
- Efficient debate maintains the quality benefits of full debate at lower cost
- Practical implementations can route questions to appropriate debate depth automatically
- The doubly-efficient property makes debate viable for production-scale systems
Methodology
Analysis of debate protocols with varying computational budgets, measuring the tradeoff between deliberation depth and output quality. Proposes routing mechanisms that allocate debate resources based on question complexity. Experiments measure quality degradation curves as debate rounds are reduced, identifying optimal cost/quality tradeoffs for different question types.
Benchmark Results
Simple questions: single-round achieves 95% of full-debate quality at 20% cost. Complex questions: 3 rounds achieve 98% quality. Routing accuracy: 89% correct complexity classification.
Consilium Implementation
Auto mode implements complexity-based routing that analyzes question difficulty and automatically selects the appropriate deliberation mode. Simple factual questions route to Quick mode (single round), while complex multi-stakeholder decisions route to Deep or Red Team modes. This optimizes cost without sacrificing quality where it matters. Consilium's template system (code_review, research_synthesis, risk_assessment, healthcare, legal, finance) extends this by pre-configuring the optimal debate depth for each domain.
How Consilium Implements the Research
Every feature maps to a specific peer-reviewed finding.
| Research Finding | Paper | Consilium Features |
|---|---|---|
| Multi-agent debate improves factuality by 8-15% | Du et al. (ICML 2024) | Council mode Deep mode Multi-round deliberation Cross-examination |
| Truth wins in structured debate even with asymmetric models | Khan et al. (ICML 2024 Best Paper) | Blind mode Identity-hidden judge evaluation Multiple argument orderings |
| Confidence-weighted consensus outperforms majority voting by 5-7% | Chen et al. (ACL 2024) | Condorcet voting Borda count Confidence-weighted ballots Ranked Pairs |
| Adversarial debate enables scalable oversight beyond judge capability | Irving et al. (Alignment Forum) | Red Team mode Typed attack/defend phases Mandatory dissent Judge synthesis |
| Multi-model discussion produces 23% more creative outputs | Li et al. (AAAI 2024) | Market mode Probability aggregation Role assignment Creative divergence |
| Complexity routing reduces debate cost by 60-80% on simple questions | Irving et al. (AI Safety) | Auto mode Complexity routing Template pre-configuration Cost optimization |
| Diverse model architectures outperform same-model ensembles by 3% | Chen et al. (ACL 2024) | 5 LLM providers 15 models Cross-architecture debate |
| Mathematical convergence detection improves reliability | Du et al. (ICML 2024) | Kendall tau (0.4) Jaccard index (0.35) Concession tracking (0.25) Threshold: 0.85 |
| Explanation stability predicts answer truthfulness | Khan et al. (ICML 2024 Best Paper) | Confidence calibration Concession rate tracking Qualification penalty |
| Role assignment increases creative diversity by 12% | Li et al. (AAAI 2024) | Red Team roles (attacker/defender/judge) Blind dialectical structure Persona-driven deliberation |