Why Deliberation Beats Orchestration
Most multi-agent frameworks — CrewAI, LangGraph orchestrations, AutoGen pipelines — treat AI models as workers in a sequence: researcher hands off to writer, writer hands off to editor, editor produces the final answer. Consilium does something different. Models are adversaries in a structured debate, not collaborators in an assembly line. This post explains why that choice changes the output and where it doesn't.
Orchestration: errors propagate
In a sequential pipeline, the second model only sees the first model's output. If the first model hallucinated a fact, made an off-by-one in the reasoning, or anchored on the wrong framing, the downstream models inherit it. They might polish the prose, but they don't go back and check whether the upstream claim was correct — that's not their job in the pipeline graph. The compounding error problem is well-known in chained agent systems.
The orchestration approach also flattens disagreement. If two models would have produced contradictory answers, you only see whichever one happened to be in the right slot at the right time. The dissent — usually the most informative signal — is gone before the response leaves the system.
Deliberation: errors get caught
Consilium runs a fixed protocol every debate:
- Round 1 — Independent analysis.Every model produces an answer to the topic in isolation. They never see each other's output in this round, so the responses are uncorrelated and any shared error has to come from training-data overlap, not from one model influencing another.
- Round 2 — Cross-examination.Each model receives every other model's Round 1 answer and is asked to challenge it on factual errors, flawed reasoning, missing evidence, and edge cases. Challenges are typed, so the engine can route each one to the right defender.
- Round 3 — Rebuttal and refinement. Defenders respond to each challenge (concede, refute, qualify, or redirect) and produce a revised answer that incorporates the survivable points and drops the indefensible ones.
- Judge — 5-phase synthesis. A separate judge model performs claim extraction, cross-reference (which claims survived challenge), dispute resolution (where models still disagree), rubric scoring (correctness 30% / reasoning 25% / completeness 20% / actionability 15% / conciseness 10%), and produces the final verdict with a dissent report attached.
The cross-examination round is the load-bearing piece. It is the only place in the protocol where one model's mistake gets named by another model in the same conversation. In orchestration, that doesn't happen — there's no round where the editor is asked "does this claim from the researcher actually hold up?"
A concrete diff
| Property | Orchestration (CrewAI / LangChain agents) | Deliberation (Consilium) |
|---|---|---|
| Model interaction | Sequential pipeline | Adversarial rounds |
| Error handling | Propagates downstream | Caught by cross-examination |
| Confidence | Self-reported | Calibrated via convergence detection |
| Disagreement | Hidden / overwritten | Surfaced as dissent reports |
| Audit trail | Logs of intermediate outputs | Structured claims, challenges, rebuttals |
When orchestration is better
Deliberation is not a universal upgrade. We don't ship Consilium as a replacement for every agent framework, and you'd misuse it if you tried. Orchestration wins for:
- Pure tool execution. If the task is "run this query, summarize the result" and the tool call is the bottleneck, multiple models arguing about the result is overkill.
- Speed-bound interactive UX. A 3-round debate adds 30–60 seconds of latency. For an autocomplete or a typing-speed chat surface, that's not the right tradeoff.
- Single-domain expert pipelines. When you genuinely have a researcher → writer → editor flow and the boundaries between roles are clear, orchestration is a more natural fit than "all three argue."
When deliberation is better
- Hard reasoning with disagreement. Architecture decisions, technical tradeoffs, code reviews where multiple defensible answers exist.
- Hallucination-prone domains. Anything where one model could confidently produce a wrong fact and the only way to catch it is another model checking. Du et al. and Khan et al. both quantify this gain in the literature.
- High-stakes decisions where dissent matters. When the user wants to see "the answer is X, but two out of five panelists pushed back on Y", deliberation surfaces that. Orchestration silently picks one.
What this looks like in code
The deliberation graph lives in apps/agents/src/features/deliberation/deliberation_graph.py and the judge in apps/agents/src/core/judge.py. It's a LangGraph state machine — round transitions are explicit nodes, the round-2 challenge generation is its own prompt, and the rebuttal classifications (concede / refute / qualify / redirect) come back as typed structured output. The whole thing is auditable: every challenge, every rebuttal, and every claim that survived to the synthesis is preserved in the debate session record.
That auditability is the other reason we picked deliberation over orchestration. When a debate produces a controversial answer, you can replay it and see exactly which model said what at which point — not just "the writer agent produced this."