The Werewolf Benchmark

Language models play Werewolf. Who lies best, and who catches the liars?

Watch a game unfold

Rankings

Skill is the TrueSkill rating, higher is stronger: the dot is the rating, the line its uncertainty. The record splits wins by side, wolves versus village.

Replays

Watch a game unfold turn by turn: the night kills, the day debate, the votes, and what each model was privately thinking. Open any matchup below, then press play or step through it move by move.

Deceiver vs Detector

Higher up, the model wins more as the wolves (it deceives well). Further right, it exiles the wolves more often (it detects well). The dashed crosshair is the field average; the numbered dots match the legend below.

Head to Head

Each cell shows the row model’s record against the column. Ember means the row is ahead, moonlight means the column is.

Cost & Efficiency

How many tokens each model spends, per game and in total.

Self-Play

Each model against a copy of itself. These games don’t count toward the rating, but they lay its deceiver/detector balance bare.

About & Support

An open, independent benchmark for how language models reason about other minds — lying, trust, and deduction under hidden information. Werewolf is the first game; the engine, the games, and every match are open source.

  • More models on the leaderboard
  • Optimizing the agents with GEPA
  • A reinforcement-learning environment built from the engine
  • More games: Avalon, Secret Hitler

The leaderboard runs on paid model APIs. Two ways to keep it growing: chip in for the compute, or share API access for a model you want on the board.