How the Deliberation Engine Works
A deep technical explanation of the state machine, voting algorithms, convergence detection, dissent clustering, and confidence calibration that power Consilium.
A. The State Machine
Consilium's deliberation engine is built on a LangGraph-based state machine. Each deliberation progresses through a defined sequence of phases, with the state object accumulating results at each step. The state machine enforces the debate protocol: no model can skip a phase, and convergence is checked mathematically before termination.
Each phase handler processes sequentially. After CONVERGENCE, the engine either loops back to PROPOSAL for another round or proceeds to OUTPUT. Round number increments after each convergence check.
| Field | Type | Description |
|---|---|---|
| topic | string | The question or topic being deliberated |
| mode | DeliberationMode | Active deliberation mode (quick/council/deep/blind/redteam/jury/market/auto) |
| round_number | int | Current round (increments after convergence check) |
| max_rounds | int | Maximum rounds before forced output |
| models | list[str] | Model IDs participating in the debate |
| judge_model | str | Model used for evaluation and synthesis |
| proposals | list[dict] | Independent positions from each model |
| challenges | list[dict] | Cross-examination results with typed objections |
| rebuttals | list[dict] | Responses: CONCEDE, REFUTE, QUALIFY, or REDIRECT |
| evaluations | list[dict] | Rubric-based scoring of each proposal |
| votes | list[dict] | Ranked ballots with confidence weights |
| aggregation_result | dict | Combined vote results (winner, method, ranking) |
| convergence_result | dict | Convergence score and recommendation |
| dissent_report | dict | Majority/minority positions via clustering |
| confidence_scores | dict | Per-model calibrated confidence |
| audit_trail | list[dict] | Every step: model, input, output, tokens, cost, latency |
| cost_tracker | dict | Cost breakdown by model and round |
| golden_prompt | str | Final synthesized answer |
PROPOSAL — Each model independently generates: claims (list of assertions), reasoning chain (step-by-step logic), confidence score, and supporting evidence. No model sees others' proposals.
CHALLENGE — Models cross-examine each other. Challenges are typed: factual errors, missing evidence, logical flaws, better alternatives. Each challenge targets a specific claim in another model's proposal.
REBUTTAL — Defenders respond with categorized rebuttals: CONCEDE (accept the challenge), REFUTE (counter with evidence), QUALIFY (accept partially with conditions), or REDIRECT (reframe the question). Rebuttal types feed into convergence and confidence metrics.
EVALUATION — Proposals scored against a rubric with weighted dimensions. Each dimension gets a 0-1 score. The rubric varies by template (e.g., security 30% + correctness 25% for code review).
VOTING — Models cast RankedBallots: an ordered preference list of proposals with a confidence_weight (0-1). Higher confidence = more influence on the final ranking.
AGGREGATION — Votes aggregated through the voting pipeline: Borda scores → full ranking → Condorcet check → Ranked Pairs fallback. Produces winner, method used, and confidence level.
CONVERGENCE — Three metrics combined to determine if debate should continue. If converged (score ≥ 0.85) or max rounds reached, proceeds to OUTPUT. Otherwise loops back to PROPOSAL.
OUTPUT — Final synthesis: judge model integrates strongest arguments, applies dissent detection, calibrates confidence, and produces the golden prompt.
B. Voting Mechanisms
Consilium implements four formal social choice theory algorithms. These aren't simple "pick the most popular" mechanisms — they're mathematically rigorous voting methods used in political science and decision theory.
Checks if any candidate beats ALL others in pairwise matchups. For each pair of candidates (A, B), counts how many voters prefer A over B (weighted by confidence_weight). If one candidate wins every pairwise comparison, it's the Condorcet winner — the strongest possible consensus.
For each pair (A, B):
score_A = sum(confidence_weight for ballots where A ranked above B)
score_B = sum(confidence_weight for ballots where B ranked above A)
A wins pair if score_A > score_B
Condorcet winner = candidate that wins ALL pairwise comparisons
Returns: single winner or None (triggers Ranked Pairs fallback)Assigns points based on rank position, weighted by voter confidence. Produces a complete ranking of all candidates, not just a winner.
For each ballot:
For each candidate at rank r (0-indexed):
points[candidate] += (n - 1 - r) * confidence_weight
Full ranking = candidates sorted by total points (descending)
Used even when Condorcet winner exists, to produce complete orderingWhen no Condorcet winner exists (a cycle: A beats B, B beats C, C beats A), Ranked Pairs resolves by locking the strongest victories first while preventing cycles.
1. List all pairwise matchups with victory margins
2. Sort by margin (descending) — strongest victories first
3. For each matchup:
- Lock the edge (winner → loser) IF it doesn't create a cycle
- Skip if it would create a cycle (topological sort check)
4. Winner = candidate with no incoming locked edges
Complexity: O(n² log n) where n = number of candidatesSimple win/loss scoring for comparative analysis. Not used for final winner selection, but provides intuitive "how dominant is this candidate?" metric.
For each candidate:
copeland_score = (# pairwise wins) - (# pairwise losses)
Range: -(n-1) to +(n-1)
Example with 4 candidates:
A beats B, C, D → score = +3 (dominant)
B beats C, loses to A, D → score = -11. Calculate Borda scores (confidence-weighted)
2. Generate full_ranking from Borda
3. Check for Condorcet winner
→ Found: return (winner, full_ranking, method="condorcet", confident=True)
→ Not found: use Ranked Pairs as tiebreaker
4. Return (ranked_pairs_winner, full_ranking, method="ranked_pairs", confident=False)C. Convergence Detection
Convergence detection determines whether the debate has reached a stable consensus or should continue for another round. Three independent metrics are combined into a single score.
Measures how similar the vote rankings are between consecutive rounds. Maps items to positions, counts concordant vs discordant pairs. Normalized to [0, 1] where 1.0 = identical rankings across rounds.
tau = (concordant_pairs - discordant_pairs) / total_pairs
normalized = tau * 0.5 + 0.5 → maps [-1, 1] to [0, 1]
concordant: pair (i,j) ranked same order in both rounds
discordant: pair (i,j) ranked opposite orderMeasures how much the actual content of proposals overlaps between rounds. Converts proposals to word sets, computes intersection/union.
For each model's proposals across rounds:
words_prev = set(proposal_round_n.lower().split())
words_curr = set(proposal_round_n+1.lower().split())
similarity = |words_prev ∩ words_curr| / |words_prev ∪ words_curr|
Average across all model-pair comparisonsFraction of rebuttals where models concede or qualify their positions. High concession = models are willing to adapt, indicating movement toward consensus.
concession_rate = count(rebuttals where type == CONCEDE or QUALIFY) / total_rebuttalsscore = 0.40 * ranking_similarity (Kendall tau)
+ 0.35 * proposal_similarity (Jaccard)
+ 0.25 * concession_rate (rebuttal analysis)
Termination rules:
round >= max_rounds → converged = True (forced)
round < 2 → converged = False (need baseline)
score >= 0.85 → converged = True (consensus)
score < 0.85 → converged = False (continue)
Output: { converged, score, components, recommendation }D. Dissent Detection
Dissent detection identifies whether models genuinely agree or if there are distinct camps with fundamentally different positions. Uses agglomerative clustering on proposal content similarity.
1. Build similarity matrix:
matrix[i][j] = Jaccard(words_i, words_j)
Symmetric: matrix[i][j] == matrix[j][i]
Diagonal: matrix[i][i] = 1.0
2. Initialize: each proposal = singleton cluster
3. Iteratively merge:
- Find closest cluster pair (highest avg pairwise similarity)
- If similarity >= 0.5 threshold: merge into one cluster
- If similarity < 0.5: stop (remaining clusters are distinct positions)
4. Interpret results:
- 1 cluster → consensus (majority only, no dissent)
- 2+ clusters → dissent detected
- Largest cluster = majority position
- Others = minority positions{
type: "consensus" | "dissent",
majority: {
models: ["claude-sonnet-4-6", "gpt-5.4"],
position_summary: "First 200 chars of largest cluster's proposal",
key_arguments: ["extracted from claims"],
proposals: [full proposal objects]
},
minority: [ // empty if consensus
{
models: ["gemini-3-flash-preview"],
position_summary: "...",
key_arguments: ["..."],
proposals: [...]
}
],
disagreement_points: [
{ challenger: "gemini", target: "claude", type: "REFUTE", argument: "..." }
]
}E. Confidence Calibration
Confidence calibration measures how much each model actually stands behind its claims. A model that caves under scrutiny gets a lower confidence score than one that defends its position with evidence. This is based on "explanation stability" — the degree to which a model's claims survive cross-examination.
stability_score = avg(Jaccard(original_claims, post_challenge_claims))
→ 1.0 = claims unchanged, 0.0 = completely revised
concession_rate = count(CONCEDE rebuttals) / total_rebuttals
→ Higher = model yielded more often
qualification_rate = count(QUALIFY rebuttals) / total_rebuttals
→ Partial yielding, less severe than concession
calibrated_confidence = stability_score
* (1 - concession_rate)
* (1 - 0.3 * qualification_rate)
Clamped to [0.0, 1.0]
Output: {
value: float, // final calibrated score
stability_score: float,
concession_rate: float,
method: "explanation_stability"
}F. Cost-Based Routing
Auto mode uses cost-based routing to select the optimal deliberation mode and model count. It extracts features from the query, scores complexity, and routes to the cheapest configuration that meets quality requirements.
token_count: number of words in topic
has_code: presence of code markers (```, def, class, import, {})
is_factual: starts with "what is", "who is", "when did", "how many"
is_creative: contains "write", "create", "design", "brainstorm", "imagine"
is_analytical: contains "compare", "analyze", "evaluate", "pros and cons"
has_stakes: contains "medical", "legal", "financial", "security",
"compliance", "hipaa", "soc"Base score (from token count):
< 20 tokens: 0.1
≤ 100 tokens: 0.3
≤ 500 tokens: 0.5
> 500 tokens: 0.7
Adjustments:
+ 0.2 if has_code
+ 0.3 if has_stakes_keywords
+ 0.2 if is_analytical
+ 0.1 if is_creative
- 0.2 if is_factual
Floor: if has_stakes and score < 0.3, boost to 0.3
Routing decision:
score < 0.3 → Quick mode, 1 model (cheapest)
score < 0.6 → Council mode, 3 models (balanced)
score ≥ 0.6 → Council mode, 3-5 models (thorough)
score ≥ 0.8 → Deep mode, 5 models (maximum)
Cost estimation:
estimated = num_api_calls * estimated_tokens * cost_per_token
Quick: 1 call
Council: num_models * 3 rounds
Deep: num_models * 5 rounds