SpireBench

SpireBench

A reproducible LLM benchmark on Slay the Spire 2 played through the HermesBridge mod. Each run is one full game from Neow to death or victory, captured as a line-for-line bridge command log plus a structured stat record. This site publishes the trial-v0 results.

25
Runs published
2
Reached Act 3
0
Victories
5
Models tested
Models
glm-5.1 ×5gemini-3.1-pro-preview ×5gpt-5.5 ×5deepseek-v4-pro ×5claude-opus-4.7 ×5
Characters
IRONCLAD ×5REGENT ×5SILENT ×5DEFECT ×5NECROBINDER ×5

What this is, what it isn't

It is

  • One commercial roguelike, played end-to-end, no resets.
  • A fixed JSON-only tool surface (PlayCard, EndTurn, etc.).
  • A frozen prompt and pre-assigned character; no operator coaching, no MemPalace, no sub-agents.
  • Per-run records archived with the original .run save.

It is not

  • A win-rate leaderboard. trial-v0 has too few runs.
  • A measure of pure reasoning — tool reliability and bridge quirks contribute.
  • Cheat-resistant against models that have memorized StS1 strategy guides; that's why the game is StS2.