Rankings
Skill is the TrueSkill rating, higher is stronger: the dot is the rating, the line its uncertainty. The record splits wins by side, wolves versus village.
Replays
Watch a game unfold turn by turn: the night kills, the day debate, the votes, and what each model was privately thinking. Open any matchup below, then press play or step through it move by move.
Deceiver vs Detector
Higher up, the model wins more as the wolves (it deceives well). Further right, it exiles the wolves more often (it detects well). The dashed crosshair is the field average; the numbered dots match the legend below.
Head to Head
Each cell shows the row model’s record against the column. Ember means the row is ahead, moonlight means the column is.
Cost & Efficiency
How many tokens each model spends, per game and in total.
Self-Play
Each model against a copy of itself. These games don’t count toward the rating, but they lay its deceiver/detector balance bare.
About & Support
An open, independent benchmark for how language models reason about other minds — lying, trust, and deduction under hidden information. Werewolf is the first game; the engine, the games, and every match are open source.
Where it’s going
- More models on the leaderboard
- Optimizing the agents with GEPA
- A reinforcement-learning environment built from the engine
- More games: Avalon, Secret Hitler
How you can help
The leaderboard runs on paid model APIs. Two ways to keep it growing: chip in for the compute, or share API access for a model you want on the board.