Research references and benchmarks
Research References and Benchmarks
Consilium's deliberation approach is grounded in peer-reviewed research. Each mode maps to findings from published papers.
1. Improving Factuality and Reasoning via Multiagent Debate
Authors: Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch
Venue: ICML 2024
Finding: Multiple LLM agents debating each other significantly improves factual accuracy by 8-15% and mathematical reasoning across benchmarks (GSM8K: 82% → 91%, MMLU: +8-12%).
Method: Multiple LLM instances propose answers, debate their reasoning, and revise based on peer feedback over multiple rounds.
Consilium implementation: Council and Deep modes implement this directly — multi-round debate with proposal, challenge, and rebuttal phases.
2. Debating with More Persuasive LLMs Leads to More Truthful Answers
Authors: Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, et al.
Venue: ICML 2024 Best Paper
Finding: Even when one debater is more persuasive, structured debate protocols still converge on truthful answers. Truth has a natural advantage in structured debate.
Method: Asymmetric debate with varying model capabilities, evaluated by human judges.
Consilium implementation: Blind mode prevents model bias by stripping identity. Judge evaluates in multiple orderings to prevent anchoring.
3. ReConcile: Round-Table Conference Improves Reasoning via Consensus
Authors: Justin Chen, Swarnadeep Saha, Mohit Bansal
Venue: ACL 2024
Finding: Diverse LLMs discussing and reaching consensus outperform any single model and simple ensembles by 3-10%.
Method: Round-table format with confidence-weighted voting across multiple rounds.
Consilium implementation: Council mode with Condorcet/Borda voting and confidence-weighted ballots.
4. AI Safety via Debate
Authors: Geoffrey Irving, Christia Amodei, Dario Amodei
Venue: Alignment Forum
Finding: Debate between AI systems can be used as an alignment technique, enabling humans to judge AI outputs on tasks they can't solve directly.
Method: Two AI systems debate while human judges evaluate.
Consilium implementation: Red Team mode (attack/defend/judge) and Jury mode (mandatory dissent reporting).
5. LLM Discussion: Enhancing Creativity via Discussion Framework
Authors: Li et al.
Venue: AAAI 2024
Finding: Structured discussion between LLMs produces more creative and diverse outputs than individual generation.
Method: Models share perspectives, build on each other's ideas, then synthesize.
Consilium implementation: Market mode's probability aggregation encourages creative divergence before convergence via log-opinion pooling.
6. Scalable AI Safety via Doubly-Efficient Debate
Authors: Irving et al.
Venue: Alignment Research
Finding: Debate can be made computationally efficient while maintaining safety guarantees through complexity-aware routing.
Consilium implementation: Auto mode's complexity-based routing optimizes cost without sacrificing quality (simple → Quick, complex → Deep).