Structured Disagreement Produces Better Decisions
Consilium implements formal argumentation protocols — proven in peer-reviewed research at ICML, ACL, and AAAI — where AI models propose, challenge, defend, and synthesize positions through adversarial debate.
Our Mission
We believe the best decisions emerge from structured disagreement. Consilium implements formal argumentation protocols — proven in peer-reviewed research — where AI models propose, challenge, defend, and synthesize through adversarial debate.
The result is consensus with tracked confidence, preserved dissent, and complete audit trails. Every conclusion is backed by evidence that survived adversarial scrutiny — not the output of a single model that was never challenged.
What Makes Consilium Different
Six technical differentiators that separate deliberation from orchestration.
Orchestration tools (CrewAI, AutoGen, LangGraph) run models in parallel and pick the best output. Consilium makes models argue, challenge claims, defend positions, vote, and only converge when mathematically confirmed. Cross-examination uses typed challenges (factual error, missing evidence, flawed logic) and categorized rebuttals (concede, refute, qualify, redirect). Each challenge must reference specific claims, and each rebuttal must provide evidence — not hand-waving.
Challenge types: FACTUAL_ERROR, MISSING_EVIDENCE, FLAWED_LOGIC. Rebuttal types: CONCEDE, REFUTE, QUALIFY, REDIRECT.Condorcet method finds the candidate that beats ALL others pairwise. Borda count provides confidence-weighted scoring across all positions. Ranked Pairs delivers cycle-free tiebreaking using a directed acyclic graph of pairwise victories. Copeland scoring enables comparative analysis by counting net pairwise wins. This is real social choice theory applied to AI consensus — not majority voting, not picking the most popular answer.
Algorithms: Condorcet, Borda Count, Ranked Pairs, Copeland.Convergence is measured using Kendall tau correlation (0.4 weight) for ranking similarity, Jaccard index (0.35 weight) for proposal overlap, and concession tracking (0.25 weight) for position shifts. The composite score must reach 0.85 before consensus is declared. If convergence stalls, the system detects it and can trigger additional rounds or escalate to a different mode. Not vibes-based — mathematically verified.
Formula: 0.4 * kendall_tau + 0.35 * jaccard + 0.25 * concession_rate >= 0.85Agglomerative clustering identifies minority positions across model responses by measuring semantic distance between position vectors. Every result includes both majority AND minority opinions. Healthcare, legal, and financial modes require explicit dissent reporting. No decision is declared unanimous unless mathematically verified through convergence metrics — and even then, the clustering algorithm surfaces the most distant position as a recorded dissent.
Clustering: agglomerative, distance-based. Output: majority position + all minority clusters.Models that change their claims under cross-examination pressure receive lower confidence scores. Calibration formula: stability * (1 - concession_rate) * (1 - 0.3 * qualification_rate). This measures explanation stability — do models hold firm on well-supported positions, or cave under scrutiny? Models that maintain their position with evidence get higher calibration; models that flip without justification get penalized.
Score = stability * (1 - concession_rate) * (1 - 0.3 * qualification_rate)Every deliberation phase is recorded: input, output, tokens used, cost, and latency per model per round. Full transparency into how consensus was reached — which models agreed, who dissented, what challenges were raised, and how they were resolved. Token counts, cost breakdowns, and timing data enable cost optimization. Required for regulated industries like healthcare, finance, and legal.
Tracked per model: tokens_in, tokens_out, cost_usd, latency_ms, round, phase.Our Story
Consilium started with a simple observation: when you ask one AI model a hard question, you get one perspective shaped by that model's training biases. Ask three models, and you get three perspectives — but no mechanism to resolve disagreements. We built that mechanism.
The breakthrough came from academic research on multi-agent debate. Papers from ICML 2024 showed that structured debate between LLMs improves factual accuracy by 8-15%, and that truth has a natural advantage in adversarial argumentation. We implemented these findings as a production platform with formal voting theory, convergence detection, and mandatory dissent preservation.
Consilium supports current-generation models across 7 providers: Anthropic (Claude Opus 4.7, Sonnet 4.6, Haiku 4.5), OpenAI (GPT-5.5 Pro, GPT-5.4), Google (Gemini 3.1 Pro, Gemini 3 Flash), xAI (Grok 4.20, Grok 4.1 Fast), Moonshot (Kimi K2.6), Groq for cost-effective inference (Llama 3.x, GPT-OSS, Compound), and OpenRouter for free-tier fallback. Models debate through a LangGraph state machine with typed challenges, categorized rebuttals, confidence-weighted voting, and mathematical convergence detection.
The architecture is a three-tier system: Next.js 15 frontend, NestJS 11 API with BullMQ job processing, and a FastAPI debate engine that orchestrates the deliberation state machine. Every phase is recorded for full auditability — which models agreed, who dissented, what evidence was cited, and how consensus was reached.
Architecture
Three-tier system with a LangGraph deliberation state machine.
Web (Next.js 15) → API (NestJS 11/Fastify) → Agents (FastAPI/Python)
↓
Debate Orchestrator
├── Round 1: Independent Analysis
├── Round 2: Cross-Examination
├── Round 3: Rebuttal & Refinement
└── Judge: 5-Phase Synthesis
Voting: Condorcet → Borda Count → Ranked Pairs → Copeland
Convergence: Kendall τ + Jaccard + Concession Tracking (threshold: 0.85)
Dissent: Agglomerative Clustering → Minority Position PreservationMeet the Founder
Why one developer is building the multi-AI council for everyone else.
Hi, I'm Saad.
I build software for a living and got tired of the same pattern: ask one AI a hard question, get an answer that's almostright, lose two hours discovering the wrong half. The fix isn't a smarter single model — it's a room of models that argue, challenge each other, and only agree when they've really agreed. That's Consilium.
Make multi-AI deliberation the default for high-stakes engineering decisions. No more single-model guesses. No more provider lock-in. The council reads your code, debates the problem, and shows its work — so you can trust the answer or push back on it.
What I value
Why I built Consilium
Every existing AI coding tool is a single model with a pretty wrapper. Cursor uses Claude. Copilot uses GPT. Gemini Code uses Gemini. Each one has blind spots, and pretending otherwise is how you ship subtly broken code.
Consilium puts seven of them in the same room — OpenAI, Anthropic, Google, Groq, xAI, Moonshot, OpenRouter — and makes them argue with each other on yourcodebase. When they disagree, you see the disagreement. When they converge, you know it's real, not a single model's preference. That's the tool I wanted, so I built it.
Built for teams
Bring your own provider keys and pay only for what you use. BYOK by default, encrypted at rest, with a full SDK and CLI story.
