LLM Consensus Engine — Concept Note

Executive Summary

The LLM Consensus Engine simulates structured debates between multiple large language models (GPT-4o, Claude, Gemini) on a shared question. Models present arguments, engage in multiple rounds of rebuttals, then anonymously vote on the most compelling response based on reasoning quality, clarity, and relevance.

The goal isn't to find "the right answer"—it's to explore how consensus emerges (or fails to) across different model architectures, how disagreement patterns develop, and what happens when AI systems evaluate each other's reasoning.

Core question: When AI models debate and vote on each other's arguments, what patterns emerge? How does consensus differ from individual model responses? What can we learn about reasoning, disagreement, and collective intelligence?

What It Does

Structured Debate Simulation

Models like GPT-4o, Claude, and Gemini participate in multi-round debates with initial arguments and rebuttals.

Anonymous Voting

After debate rounds, models vote on the most compelling response, evaluating reasoning quality rather than model identity.

Transparent Rationale

Each vote includes explicit reasoning, enabling post-analysis of consensus patterns and disagreement drivers.

Debug & Analysis Tools

Debug panels and voting rationale views help understand how consensus forms (or breaks down) across different questions.

Why This Is Interesting

Consensus vs. Individual Responses

Individual AI models give individual answers. But when models debate and vote collectively, different patterns emerge. Consensus voting can surface reasoning that no single model would have produced alone, or reveal fundamental disagreements that individual responses mask.

Cross-Model Reasoning Evaluation

The engine doesn't just aggregate responses—it has models evaluate each other's reasoning. GPT-4 votes on Claude's arguments. Claude evaluates Gemini's logic. This creates a meta-reasoning layer: AI systems judging AI systems, revealing how different architectures evaluate quality, coherence, and persuasiveness.

Exploring Disagreement Patterns

When do models converge? When do they diverge? The engine tracks consensus formation across rounds, showing how initial disagreements evolve (or don't) through structured debate. This illuminates both the strengths and blind spots of different reasoning approaches.

How It Works

Debate Structure

Initial Arguments: Each model presents an opening position on the shared question
Rebuttal Rounds: Models respond to each other's arguments, refining positions
Voting Phase: Models anonymously evaluate all arguments and vote for the most compelling
Rationale Disclosure: Voting rationales are revealed for analysis

Voting Mechanism

Voting uses GPT-based evaluation of the full dialogue history. Models assess arguments on reasoning quality, clarity, relevance, and coherence—not model identity. Votes are anonymous to reduce bias toward specific architectures.

Design choice: Anonymous voting ensures consensus emerges from argument quality, not brand recognition or model reputation. The system evaluates reasoning, not sources.

Technical Implementation

Built as a Streamlit application hosted on EC2, the engine orchestrates API calls to multiple LLM providers. It includes synthetic fallback logic for development/testing when real APIs are unavailable. The UI features collapsible sections for managing debate rounds, votes, and analysis.

Current Features

Multi-model debate: GPT-4o, Claude (Anthropic), and Gemini participate in structured debates
Model-specific personality prompts: Each model receives tailored prompts to encourage distinct reasoning styles
Collapsible UI sections: Streamlit interface organizes rebuttals, votes, and rationales for exploration
GPT-based voting: Uses full dialogue history for informed voting decisions
Synthetic fallback: Development mode with simulated responses when APIs are unavailable

Planned Features

Leaderboards: Track model performance across debate topics and question types
Historical debate archive: Searchable database of past debates and consensus patterns
Per-round performance analytics: Understand how model performance varies by debate stage
Human-vs-model voting mode: Compare human judgments with model consensus

What This Explores

Research questions:

How does consensus differ from individual model outputs?
What patterns emerge when AI systems evaluate each other's reasoning?
Do certain model architectures consistently win debates, or does it vary by topic?
How do disagreement patterns reveal blind spots in different reasoning approaches?
Can collective AI reasoning produce insights that individual models miss?

Broader Context

This project is part of a broader exploration of AI self-awareness, ethical reasoning, and collective intelligence. It asks: What happens when we treat AI systems not as isolated tools, but as a panel of reasoning agents that can debate, disagree, and find consensus?

The engine doesn't assume consensus is always desirable—sometimes disagreement reveals more than agreement. It's a tool for understanding how reasoning, evaluation, and collective decision-making work across different AI architectures.

Status

This page is a public concept note—shared for discussion and posterity. The LLM Consensus Engine is an active experiment, hosted on EC2 and accessible via SSH port forwarding. The codebase is part of ongoing research into AI reasoning, consensus formation, and collective intelligence.

Contributions welcome. This repo explores AI self-awareness, ethical reasoning, and collective intelligence. If you're into those things, you're in good company.

Published: December 2025 · Author: Sean Wylie · seanwylie.ca

← Back to Concepts | Home