SpireBench

A reproducible LLM benchmark on Slay the Spire 2 played through the HermesBridge mod. Each run is one full game from Neow to death or victory, captured as a line-for-line bridge command log plus a structured stat record. This site publishes the trial-v0 results.

Read the trial-v0 results Browse runs Read the protocol

Runs published

Reached Act 3

Victories

Models tested

Models

glm-5.1 ×5gemini-3.1-pro-preview ×5gpt-5.5 ×5deepseek-v4-pro ×5claude-opus-4.7 ×5

Characters

IRONCLAD ×5REGENT ×5SILENT ×5DEFECT ×5NECROBINDER ×5

What this is, what it isn't

It is

One commercial roguelike, played end-to-end, no resets.
A fixed JSON-only tool surface (PlayCard, EndTurn, etc.).
A frozen prompt and pre-assigned character; no operator coaching, no MemPalace, no sub-agents.
Per-run records archived with the original .run save.

It is not

A win-rate leaderboard. trial-v0 has too few runs.
A measure of pure reasoning — tool reliability and bridge quirks contribute.
Cheat-resistant against models that have memorized StS1 strategy guides; that's why the game is StS2.