6 Papers, 8 Modes

Research

The peer-reviewed science behind multi-agent deliberation. Every Consilium feature maps to a specific finding from published research at ICML, ACL, AAAI, and the AI safety community.

ICML 2024

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch

Abstract

This paper demonstrates that having multiple LLM instances propose answers, debate their reasoning, and iteratively revise their responses leads to significant improvements in factual accuracy and mathematical reasoning. The debate mechanism encourages models to identify and correct errors in each other's reasoning chains, producing more reliable outputs than any single model run. The authors show that the improvement scales with both the number of agents and the number of debate rounds, with diminishing returns after 3-4 rounds.

Key Findings

Multi-agent debate improves factual accuracy by 8-15% across benchmarks
GSM8K math reasoning improved from 82% to 91% with 3-agent debate
MMLU scores improved 8-12% compared to single-model baselines
Debate is most effective on questions requiring multi-step reasoning
Improvement scales with agent count and rounds, diminishing after 3-4 rounds

Methodology

Multiple LLM instances independently propose answers, then engage in structured debate rounds where they critique and revise each other's responses. Convergence is measured by answer stability across rounds. Experiments run across GSM8K, MMLU, and TruthfulQA benchmarks with varying agent counts (2-6) and round counts (1-6).

Benchmark Results

GSM8K: 82% → 91% (3 agents, 3 rounds). MMLU: +8-12% over single model. TruthfulQA: +14% on adversarial questions.

Consilium Implementation

Council and Deep modes implement this paper's debate protocol directly. In Council mode, 3+ models deliberate across multiple rounds with cross-examination. Deep mode extends this with sub-agent research for complex questions requiring extended reasoning chains. Consilium's convergence detection (Kendall tau + Jaccard + concession tracking) formalizes the paper's answer stability measurement into a mathematical threshold (>= 0.85).

ICML 2024 Best Paper

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, et al.

Abstract

This ICML 2024 Best Paper award winner investigates what happens when debaters have asymmetric capabilities — when one model is more persuasive than another. The key finding is that even with one more persuasive debater, structured debate protocols still converge on truthful answers because truth has a natural advantage in debate. Truthful arguments are easier to defend under repeated scrutiny, while false arguments require increasingly elaborate justifications that eventually collapse under adversarial pressure.

Key Findings

Truth has a natural advantage in structured debate — truthful positions are easier to defend
Even asymmetric debates (strong vs. weak model) converge on correct answers
Structured protocols prevent persuasive but incorrect arguments from dominating
Validates debate as a scalable oversight method for AI alignment
Judges improve accuracy when evaluating debate transcripts vs. direct answers

Methodology

Asymmetric debate experiments where models of varying capability argue for correct and incorrect positions. Human and AI judges evaluate debate transcripts without knowing which model argued which side. Experiments measure judge accuracy across multiple debate formats: single-turn, multi-turn, and cross-examination. The study controls for model capability by pairing GPT-4 against Claude and measuring convergence rates.

Benchmark Results

Judge accuracy: 76% (direct) → 88% (after debate). Asymmetric pairing: truth-side wins 84% of debates regardless of model strength.

Consilium Implementation

Blind mode implements this paper's insight by hiding model identities during evaluation, preventing brand bias. The judge evaluates arguments purely on merit using multiple argument orderings. This ensures a more persuasive model cannot win through reputation alone — only through the strength of its evidence. Consilium's confidence calibration formula (stability * (1 - concession_rate) * (1 - 0.3 * qualification_rate)) directly operationalizes the paper's finding that explanation stability predicts truthfulness.

ACL 2024

ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs

Justin Chen, Swarnadeep Saha, Mohit Bansal

Abstract

ReConcile proposes a round-table discussion framework where diverse LLMs engage in multi-round discussions, share confidence scores, and update their positions based on group deliberation. The paper demonstrates that this approach consistently outperforms both the best individual model and simple ensemble methods like majority voting. The key insight is that confidence-weighted consensus captures more information than simple aggregation — models that are uncertain about their answers appropriately defer to more confident peers.

Key Findings

3-10% improvement over the best individual model across reasoning benchmarks
Confidence-weighted voting outperforms simple majority voting by 5-7%
Diverse model ensembles (different architectures) perform better than same-model ensembles
Round-table format enables models to learn from each other's reasoning strategies
Optimal performance at 3-5 models; beyond 5, diminishing returns

Methodology

Round-table conference format where diverse LLMs discuss problems across multiple rounds, sharing confidence-weighted votes. Models update their positions based on the group's reasoning, with final answers determined by confidence-weighted consensus. Experiments compare same-architecture vs. cross-architecture ensembles across StrategyQA, ARC, and MATH benchmarks.

Benchmark Results

StrategyQA: +7% over best single model. ARC-Challenge: +5%. MATH: +10% on hardest problems. Cross-architecture ensembles: +3% over same-architecture.

Consilium Implementation

Council mode implements the round-table format with Condorcet and Borda count voting systems. Consilium extends the paper's approach with confidence-weighted ballots, Ranked Pairs tiebreaking, and Copeland scoring for comparative analysis — applying formal social choice theory to the consensus mechanism. The paper's finding that diverse architectures outperform same-model ensembles is why Consilium supports 5 providers (Anthropic, OpenAI, Google, xAI, Groq) for cross-architecture deliberation.

Alignment Forum

AI Safety via Debate

Geoffrey Irving, Christia Amodei, Dario Amodei

Abstract

This foundational paper proposes debate as an alignment technique where two AI systems argue for opposing positions while a human (or AI) judge evaluates. The key insight is that debate enables judges to assess the quality of AI outputs even on tasks they cannot solve directly — the adversarial structure forces both sides to surface the strongest evidence, making evaluation tractable. The paper provides theoretical analysis showing that optimal play in debate converges on truthful answers under reasonable assumptions about the judge's ability to verify evidence.

Key Findings

Debate enables evaluation of AI outputs on tasks beyond the judge's direct capability
Adversarial structure incentivizes surfacing the strongest evidence for each position
Debate scales better than direct human oversight for complex tasks
Optimal play in debate converges on truth under reasonable verification assumptions
The approach provides a natural mechanism for identifying and preserving minority opinions

Methodology

Two AI systems debate opposing positions on a given question. A judge (human or AI) evaluates the debate transcript and selects the winning position. The adversarial incentive structure ensures both sides present their strongest arguments. Theoretical analysis proves convergence properties under various judge capability assumptions.

Benchmark Results

Theoretical: optimal debate converges to truth with O(log n) judge queries for n-bit answers. Empirical validation in subsequent papers (Khan et al., Du et al.).

Consilium Implementation

Red Team mode implements the attack/defend/judge framework directly. Models take adversarial positions, challenge each other with typed attacks (FACTUAL_ERROR, MISSING_EVIDENCE, FLAWED_LOGIC), and a judge synthesizes the final assessment. Jury mode extends this with mandatory dissent — ensuring minority opinions are preserved even when the majority reaches consensus. The paper's theoretical convergence guarantees motivate Consilium's mathematical convergence threshold (0.85).

AAAI 2024

LLM Discussion: Enhancing the Creativity of Large Language Models via Discussion Framework and Role-Play

Li et al.

Abstract

This paper explores how structured discussion between LLMs produces more creative and diverse outputs than individual generation. By having models propose ideas, critique each other's proposals, and build on promising directions collaboratively, the discussion framework overcomes the tendency of individual models to produce safe, predictable outputs. The authors demonstrate that role assignment — giving models specific personas during discussion — further improves creative diversity by forcing exploration of perspectives that a single model would not naturally adopt.

Key Findings

Structured multi-model discussion produces 23% more creative outputs (human evaluation)
Discussion format encourages exploration of unconventional approaches
Role-play assignment increases creative diversity by forcing perspective shifts
Collaborative refinement improves both novelty and quality simultaneously
Models build on each other's ideas in ways single models cannot self-generate

Methodology

Multiple LLMs engage in structured discussion rounds: initial ideation, critique and exploration, collaborative refinement. Models are assigned distinct roles (e.g., 'optimist', 'skeptic', 'domain expert') to force perspective diversity. Creativity metrics (novelty, diversity, quality, usefulness) are evaluated by both human judges and automated metrics across story generation, product ideation, and problem-solving tasks.

Benchmark Results

Story generation novelty: +23% (human eval). Product ideation: +31% unique ideas. Problem solving: +18% solution diversity. Role-play vs. no-role: +12% creative diversity.

Consilium Implementation

Market mode's probability aggregation mechanism encourages creative divergence before convergence. Models stake credibility on positions, which incentivizes novel perspectives that can differentiate from the consensus. The prediction market structure rewards models that identify valuable unconventional insights early. The paper's role-play finding informs Consilium's Red Team role assignment (attacker, defender, judge) and the dialectical structure of Blind mode (risk advocate vs. acceptability advocate).

AI Safety Research

Scalable AI Safety via Doubly-Efficient Debate

Irving et al.

Abstract

This paper extends the debate framework to address computational efficiency, demonstrating that debate can be made practically efficient while maintaining safety guarantees. The 'doubly-efficient' property ensures that both the debaters and the judge can operate within reasonable computational budgets, making debate-based oversight viable for production systems. The authors propose complexity-based routing where simple questions skip full debate and only complex, high-stakes questions receive the full multi-round treatment.

Key Findings

Debate protocols can be optimized for cost without sacrificing safety guarantees
Complexity-based routing reduces cost by 60-80% on simple questions
Efficient debate maintains the quality benefits of full debate at lower cost
Practical implementations can route questions to appropriate debate depth automatically
The doubly-efficient property makes debate viable for production-scale systems

Methodology

Analysis of debate protocols with varying computational budgets, measuring the tradeoff between deliberation depth and output quality. Proposes routing mechanisms that allocate debate resources based on question complexity. Experiments measure quality degradation curves as debate rounds are reduced, identifying optimal cost/quality tradeoffs for different question types.

Benchmark Results

Simple questions: single-round achieves 95% of full-debate quality at 20% cost. Complex questions: 3 rounds achieve 98% quality. Routing accuracy: 89% correct complexity classification.

Consilium Implementation

Auto mode implements complexity-based routing that analyzes question difficulty and automatically selects the appropriate deliberation mode. Simple factual questions route to Quick mode (single round), while complex multi-stakeholder decisions route to Deep or Red Team modes. This optimizes cost without sacrificing quality where it matters. Consilium's template system (code_review, research_synthesis, risk_assessment, healthcare, legal, finance) extends this by pre-configuring the optimal debate depth for each domain.

How Consilium Implements the Research

Every feature maps to a specific peer-reviewed finding.

Research Finding	Paper	Consilium Features
Multi-agent debate improves factuality by 8-15%	Du et al. (ICML 2024)	Council mode Deep mode Multi-round deliberation Cross-examination
Truth wins in structured debate even with asymmetric models	Khan et al. (ICML 2024 Best Paper)	Blind mode Identity-hidden judge evaluation Multiple argument orderings
Confidence-weighted consensus outperforms majority voting by 5-7%	Chen et al. (ACL 2024)	Condorcet voting Borda count Confidence-weighted ballots Ranked Pairs
Adversarial debate enables scalable oversight beyond judge capability	Irving et al. (Alignment Forum)	Red Team mode Typed attack/defend phases Mandatory dissent Judge synthesis
Multi-model discussion produces 23% more creative outputs	Li et al. (AAAI 2024)	Market mode Probability aggregation Role assignment Creative divergence
Complexity routing reduces debate cost by 60-80% on simple questions	Irving et al. (AI Safety)	Auto mode Complexity routing Template pre-configuration Cost optimization
Diverse model architectures outperform same-model ensembles by 3%	Chen et al. (ACL 2024)	5 LLM providers 15 models Cross-architecture debate
Mathematical convergence detection improves reliability	Du et al. (ICML 2024)	Kendall tau (0.4) Jaccard index (0.35) Concession tracking (0.25) Threshold: 0.85
Explanation stability predicts answer truthfulness	Khan et al. (ICML 2024 Best Paper)	Confidence calibration Concession rate tracking Qualification penalty
Role assignment increases creative diversity by 12%	Li et al. (AAAI 2024)	Red Team roles (attacker/defender/judge) Blind dialectical structure Persona-driven deliberation