Replay: Training Language Models on Games

Introducing Replay: a suite of game environments, some adapted from existing games and some built from scratch. Games compress crucial skills like planning, resource management, and exploration under uncertainty into learnable environments with clear rewards, and thus are an ideal testbed for teaching models long-horizon reasoning.

We started by running GRPO with a small LLM to see how well RLVR paradigms can use games to improve model capabilities and generalize to other tasks. In Among AIs, we used Among Us to benchmark social reasoning and goal following; Replay broadens the scope to multiple games. With Replay, we're focused on three questions:

Skill transfer vs interference: Do different games share a common substrate of skills? Does training across games compound, or does it introduce tradeoffs where being good at one game makes you worse at another?
Training efficiency: Can one multi-task agent reach broad competence more efficiently than training a separate expert per game, if you hold environment interactions and model size constant?
Optimized skill transfer: Which games create the most transferable skills? Fast arcade games with dense feedback? Slow strategy/puzzle games with sparse rewards? Navigation tasks where risk and uncertainty dominate?

Human intelligence is inherently interactive. It unfolds over time, drawing on experience as we explore an environment, plan, reflect, and adjust toward a goal. Static math and coding benchmarks measure important features like syntactic competence, pattern recognition, and some forms of reasoning, but they don't tell you much about how an agent learns to think over time in an environment. The goal of Replay is to explore this gap. These results are early, but the signal is clear: with the right design, games are a training gym for economically valuable skills.

Three Takeaways So Far

Games teach transferable skills.

We saw GSM8K improvements from training on Flappy Bird and Snake, suggesting that games can teach multiple transferable skills simultaneously, unlike narrower environments that target individual capabilities.

Converting games to LLM environments is non-trivial.

Scaffolding game state into text and shaping rewards for GRPO is more art than science. There's no fixed recipe, and getting it wrong doesn't mean learning degrades; it just means no learning.

GRPO may not be the right algorithm for games.

Credit assignment over hundreds of actions is fundamentally different from math or code verification. Value-function-based approaches might be better suited to this domain.

Replay ranges from simple, text-representable games to richer physics simulations. Today we're open-sourcing a few of them, along with what we've learned from training on them.

The Environments

By comparing different classes of games, we aim to identify which mechanics most reliably produce skills that transfer out of distribution. Replay includes 17 games across different categories, of which we've open-sourced the following 11:

Arcade: Flappy Bird, Snake, Pacman
Gridworld: Lava Gap, Obstructed Maze, Locked Room (MiniGrid-based)
Toy Text: Blackjack, Cliff Walking, Frozen Lake, Taxi
Boardgame: Catan

The Gridworld and Toy Text environments were adapted from Farama Gymnasium (previously OpenAI Gym). These environments were designed for traditional RL algorithms with policy-based approaches in mind. The observation and action spaces were numerical and compact, optimized for neural network policies rather than language models. To make them usable, we had to rebuild the interface.

The Game-LLM Interface

Game logic is the easy part. The hard part is presenting state to a language model in a way it can reason about. Most games run continuously: actions happen fast, state updates constantly. To make these work with LLMs, you have to discretize time and translate observations into text without losing the information that matters for good play.

Scaffolding is more art than science. Pure numerical state and the model drowns in coordinates with no semantic anchor. Pure natural language and the model loses the precision needed to plan. The sweet spot is a mix of both: spatial relationships described in words, backed by exact positions and distances. Verbose prompts cause problems too. Stuff too much into each turn and the model struggles with context rot, losing track of information from earlier in the episode.

Making Games Learnable

The verifiability of games makes them attractive for RL, but verifiable does not mean easy. Many games have sparse rewards over long episodes. Binary rewards work fine for math and code. Credit assignment is clean throughout solutions and verifiers condense into one correct/incorrect response. For games however, the model takes hundreds of actions across hundreds of turns. When you get a win/lose at the end, which of those 300 actions mattered? The tap on turn 12? The left turn on step 47? The gradient signal gets diluted. Games are also path-dependent in messy ways—a bad action on turn 10 might not kill you until turn 50.

Given GRPO, the fix is shaping rewards at intermediate steps to guide the model without changing what optimal play looks like. This is delicate. Aggressive shaping can lead to reward hacking, where the model optimizes for partial credit instead of winning. In some ways, GRPO seems unsuitable for games, and value-function-based approaches might be better. Much of our reward shaping looked like manually guiding the agent toward correct moves.

Neither scaffolding nor reward shaping has a fixed recipe. But we think our experience can provide a useful blueprint for others attempting this work. We go deeper on specific examples in the Game Deep Dives below.

Training Setup

We trained Qwen2.5-1.5B-Instruct using GRPO via Prime-RL, with environments hosted on the Prime-RL environments hub. Hardware ranged from 2-6 A100s depending on the game.

Prior to GRPO, we ran SFT on roughly 1000 rollouts of expert data for each game. For some games, these were trajectories from heuristic approaches like A* (for Snake), and for others they were generated by frontier models that can beat the game. SFT serves two purposes: it gives the model a starting policy that can at least play the game, and it locks in output formats for the model. Without this warmup, RL training often fails to get off the ground.

Rollout length varied dramatically across environments. Snake episodes are short: an apple is always within reach, so the reward signal comes quickly. Catan sits at the other extreme, with games running for hundreds of turns before a winner emerges and rewards that are both sparse and delayed. This shaped our training requirements. Short-rollout games trained with modest batch sizes, but long-rollout games needed larger batches to get enough signal per update. Some games simply weren't tractable at our current scale, though we expect scaling along multiple axes (larger models, bigger batches, more compute) to make them viable.

Game Deep Dives

Flappy Bird

Speed

Loading rollout data...

This flash-in-the-pan mobile hit was one of the games we scaffolded from scratch, reframing it as a turn-based physics problem.

Observation space

Each tick, the model receives its position, velocity, upcoming pipe locations and gap positions, plus the physics constants:

<FLAPPY id=K>
<OBS>
birdY:3.2
birdX:4.0
velY:-0.6
gapHeight:6.0
birdRadius:0.20
pipes:[(12.5,1.2),(18.2,-0.8)]
score:4
</OBS>
</FLAPPY>

Action space

The model outputs TAP or nothing, wrapped in a reasoning trace:

<THINK>Bird falling, next pipe gap at y=1.2, need to gain height</THINK>
<ACTIONS>[TAP]</ACTIONS>

The physics are explicit: gravity is -0.3 per tick, tap impulse is +1.6. With this information, the model can project N ticks forward and determine whether it clears the upcoming gap. The observation is compact enough that context rot isn't a concern.

Reward structure

The agent receives +1 for each pipe cleared, plus a small survival bonus that accumulates each tick. The problem was that one LLM turn corresponds to one game tick, and there are dozens of ticks before the bird even reaches the first pipe. During this stretch, survival rewards are small and uniform, giving the model no signal about whether it's on a good trajectory. GPT-5.1 could pass multiple pipes by explicitly performing physics calculations in its reasoning, but GPT-4o class models kept hitting bounds before getting anywhere near the first pipe, performing no better than sub-10B models.

For a small model like Qwen2.5-1.5B, training on the raw reward signal was hopeless. The model couldn't survive long enough to ever see a pipe-clearing reward, so there was nothing to learn from.

Our fix was augmenting the survival reward with a time-based multiplier (1.0 + 0.005 × min(step, 80)) that increases as the episode progresses. This creates a gentle curriculum: early survival is rewarded, but sustained play is rewarded more. The shaping tells the agent "you're doing well" as it approaches the first pipe, providing signal before any pipes are cleared.

After RL training, the 1.5B Qwen model beat GPT-5.1 by a significant margin—the only environment where our trained model outperformed frontier models.

Snake

Speed

GPT-5.1 (17 apples)

Loading rollout data...

Snake is a classic: navigate a grid, eat apples, grow longer, don't collide with yourself or the walls. We scaffolded it on an 8×8 grid and the model receives its head position, body segment positions, apple location, and current HP as an observation.

<APPLE id=3>
<OBS>
size:8
head:(4,5)
body:{(4,5),(4,4),(4,3),(4,2),(4,1)}
food:(2,7)
hp:4
</OBS>
</APPLE>

Unlike Flappy Bird where one LLM turn equals one game tick, Snake allows the model to output an entire trajectory to the apple in a single turn. The model plans a full path, outputs all the moves at once, and the game executes them sequentially. This serves a dual purpose of being more context length efficient, and making credit assignment cleaner: either the whole trajectory works or it doesn't.

<THINK>
path coordinates: [(3,5), (2,5), (2,6), (2,7)]
moves to reach the apple: [UP, UP, RIGHT, RIGHT]
body squares overlapping with the path: []
...
therefore: path is safe
</THINK>
<ACTIONS>[UP, UP, RIGHT, RIGHT]</ACTIONS>

Reward structure

The agent receives +1 for each apple eaten. Episodes are naturally short since an apple is always reachable within a few moves, so reward signal comes quickly compared to games like Flappy Bird. We added an HP system to handle invalid trajectories gracefully: the snake starts with 3 HP, gains 1 HP per apple eaten, and loses 1 HP when it outputs an invalid path. This penalizes guessing without immediately ending the episode, giving the model more attempts to learn from mistakes.

Results. Training on Snake produced a 3% improvement on GSM8K flexible extract, corroborating similar findings from other work on Snake-like environments. The game requires planning move sequences while tracking how state evolves across multiple steps. This sequential state-tracking and maintaining a mental model of where things will be appears to transfer to multi-step mathematical reasoning.

MiniGrid: Obstructed Maze

Speed

Grok-4-fast

Loading rollout data...

MiniGrid is a collection of discrete grid-world environments for RL research, featuring goal-oriented tasks like navigating mazes, picking up objects, and opening doors with keys. We worked with three MiniGrid environments: Lava Gap, navigating around a river of lava through a single safe gap; Locked Room, finding a colored key and unlocking a matching door; and Obstructed Maze, the hardest of the three. Here we focus on Obstructed Maze.

MiniGrid posed different challenges from our arcade games. The original environments from Farama give you observations like this:

[[2, 5, 0], [1, 0, 0], [2, 5, 0]]

These compact numerical arrays work for CNN-based policies but are meaningless to language models. We rebuilt the observation space around global coordinates with semantic descriptions:

<OBS id=N>
Grid: 7x7 (origin at bottom-left, X→right, Y→up)
You: (3,2) facing EAST, holding [yellow key]
In front (4,2): yellow door (locked)
Mission: Open the yellow door and reach the goal

Visible objects:
  - yellow key at (2,4)
  - yellow door at (4,2)
  - green goal at (6,3)
</OBS>

MiniGrid's observation space changes every step; one rotation or move shifts every relative position in view. We initially tried one game step per rollout turn, which even frontier models couldn't play well. Our fix was allowing up to 20 actions per turn, letting the model reason in phases: rotate, observe, move, observe, pick up, observe. After this change, frontier models like GPT-4, Claude Opus, and Grok-4 solved even the hardest MiniGrid levels.

Unlike Flappy Bird, MiniGrid has sparse rewards by default: nothing until you reach the goal. This made training noisy, with gradient signal too weak to learn from. Each environment comes with a mission string describing the objective (e.g., "Open the yellow door and reach the goal"), so we introduced milestone rewards tracking progress against this mission. Intermediate steps are rewarded only when they advance the stated goal; picking up the wrong key or opening an irrelevant door gives nothing.

Obstructed Maze

Obstructed Maze requires the longest action sequence: navigate a maze, find a box, open it to reveal a key, use the key to open a door, and pick up a blue ball. Multiple rooms, multiple objects, and a long dependency chain where early mistakes compound.

Shaping rewards are generous given the difficulty: +0.5 for opening the correct box, +0.5 for picking up the correct key, +0.5 for opening the correct door, and +1.0 for picking up the blue ball. Total shaping budget is 2.5, compared to a native reward of up to 1.0 for completion. Without this shaping, we saw no learning.

Results

Early signs of generalization are promising. Flappy Bird produced our strongest result: the RL'd 1.5B model beat GPT-5.1 on the game, the only environment where this happened. Both Flappy Bird and Snake showed a 3% improvement on GSM8K flexible extract. Snake requires planning move sequences while tracking how game state evolves (where your body will be when your head arrives), and this kind of sequential state-tracking appears to transfer to multi-step mathematical reasoning.

Other environments trained reliably but we haven't run them to completion. MiniGrid environments needed milestone shaping to learn at all, but with shaping, models solved the puzzles. Stochastic environments like Blackjack were harder to get consistent signals from.

Try It Yourself

All of the game environments mentioned here are open source at antimlabs.com/replay. Each environment is fully scaffolded for LLM training, compatible with Prime-RL, and documented with the prompting formats we used. We're interested in seeing what others find:

Results at larger model scales
Alternative RL algorithms (value-function methods)
Different scaffolding approaches
New games that produce good training signals

What's Next

We're continuing to train on Replay environments with more compute, exploring whether multi-task training produces broader generalization. At the same time, we're expanding the suite to larger and more complex games, and building stronger evaluations inspired by Among AIs.

Want to learn more? Reach out at shreyko@antimlabs.com or follow us on