Executive Summary
The LLM Consensus Engine simulates structured debates between multiple large language models (GPT-4o, Claude, Gemini) on a shared question. Models present arguments, engage in multiple rounds of rebuttals, then anonymously vote on the most compelling response based on reasoning quality, clarity, and relevance.
The goal isn't to find "the right answer"—it's to explore how consensus emerges (or fails to) across different model architectures, how disagreement patterns develop, and what happens when AI systems evaluate each other's reasoning.
What It Does
Structured Debate Simulation
Models like GPT-4o, Claude, and Gemini participate in multi-round debates with initial arguments and rebuttals.
Anonymous Voting
After debate rounds, models vote on the most compelling response, evaluating reasoning quality rather than model identity.
Transparent Rationale
Each vote includes explicit reasoning, enabling post-analysis of consensus patterns and disagreement drivers.
Debug & Analysis Tools
Debug panels and voting rationale views help understand how consensus forms (or breaks down) across different questions.
Why This Is Interesting
Consensus vs. Individual Responses
Individual AI models give individual answers. But when models debate and vote collectively, different patterns emerge. Consensus voting can surface reasoning that no single model would have produced alone, or reveal fundamental disagreements that individual responses mask.
Cross-Model Reasoning Evaluation
The engine doesn't just aggregate responses—it has models evaluate each other's reasoning. GPT-4 votes on Claude's arguments. Claude evaluates Gemini's logic. This creates a meta-reasoning layer: AI systems judging AI systems, revealing how different architectures evaluate quality, coherence, and persuasiveness.
Exploring Disagreement Patterns
When do models converge? When do they diverge? The engine tracks consensus formation across rounds, showing how initial disagreements evolve (or don't) through structured debate. This illuminates both the strengths and blind spots of different reasoning approaches.
How It Works
Debate Structure
- Initial Arguments: Each model presents an opening position on the shared question
- Rebuttal Rounds: Models respond to each other's arguments, refining positions
- Voting Phase: Models anonymously evaluate all arguments and vote for the most compelling
- Rationale Disclosure: Voting rationales are revealed for analysis
Voting Mechanism
Voting uses GPT-based evaluation of the full dialogue history. Models assess arguments on reasoning quality, clarity, relevance, and coherence—not model identity. Votes are anonymous to reduce bias toward specific architectures.
Technical Implementation
Built as a Streamlit application hosted on EC2, the engine orchestrates API calls to multiple LLM providers. It includes synthetic fallback logic for development/testing when real APIs are unavailable. The UI features collapsible sections for managing debate rounds, votes, and analysis.
Current Features
- Multi-model debate: GPT-4o, Claude (Anthropic), and Gemini participate in structured debates
- Model-specific personality prompts: Each model receives tailored prompts to encourage distinct reasoning styles
- Collapsible UI sections: Streamlit interface organizes rebuttals, votes, and rationales for exploration
- GPT-based voting: Uses full dialogue history for informed voting decisions
- Synthetic fallback: Development mode with simulated responses when APIs are unavailable
Planned Features
- Leaderboards: Track model performance across debate topics and question types
- Historical debate archive: Searchable database of past debates and consensus patterns
- Per-round performance analytics: Understand how model performance varies by debate stage
- Human-vs-model voting mode: Compare human judgments with model consensus
What This Explores
Research questions:
- How does consensus differ from individual model outputs?
- What patterns emerge when AI systems evaluate each other's reasoning?
- Do certain model architectures consistently win debates, or does it vary by topic?
- How do disagreement patterns reveal blind spots in different reasoning approaches?
- Can collective AI reasoning produce insights that individual models miss?
Broader Context
This project is part of a broader exploration of AI self-awareness, ethical reasoning, and collective intelligence. It asks: What happens when we treat AI systems not as isolated tools, but as a panel of reasoning agents that can debate, disagree, and find consensus?
The engine doesn't assume consensus is always desirable—sometimes disagreement reveals more than agreement. It's a tool for understanding how reasoning, evaluation, and collective decision-making work across different AI architectures.
Status
This page is a public concept note—shared for discussion and posterity. The LLM Consensus Engine is an active experiment, hosted on EC2 and accessible via SSH port forwarding. The codebase is part of ongoing research into AI reasoning, consensus formation, and collective intelligence.
Contributions welcome. This repo explores AI self-awareness, ethical reasoning, and collective intelligence. If you're into those things, you're in good company.