Embodied Agents in Simulated Worlds

We trained a small VLM-based agent to play Elden Ring in real time from pixels alone. Curious how it actually plays? Read on — you’ll see it in action. The goal was not to complete the game or claim that our setup used the best model or system architecture for the task. Rather, it was to test whether interactive 3D environments can produce useful training data for language-steerable embodied agents. The early result: even a short training run produced basic navigation, combat, and environment interaction. It also surfaced key issues around memory, navigation, and self-correction.

This experiment uses only a tiny slice of publicly available data. Orders of magnitude more data in the same format is easily accessible to anyone, and increasing the amount and diversity of training data is a natural path toward more capable general agents of this type.

We’d like to acknowledge ByteDance’s Lumine¹ and NVIDIA’s NitroGen.² This research project builds on ideas from both works, combining elements of their experimental setups.

Why video games?

Multimodal models can describe a frame, answer questions about a video, and reason about a scene. However, embodied agents require a different skill set. They need to observe the world, maintain state, choose an action, execute it, and then continue operating from the state created by that action. When the world itself is evolving rapidly, latency becomes a critical consideration.

Games serve as a convenient testbed because they already contain rich world state, diverse tasks, and large amounts of publicly available demonstration data. The longer-term opportunity is to generate controllable environments directly: varied layouts, task distributions, and evaluation scenarios, which can serve as a scalable pretraining substrate for more general embodied intelligence. At Antim Labs, we are already building toward this with Gizmo, our world generation platform.

Building the dataset

NVIDIA’s NitroGen showed that internet gameplay videos with visible controller overlays can be converted into large-scale video-action datasets. Their key finding was that some online gameplay videos include synchronized gamepad overlays, and that this data can be used to train video game agents. However, NitroGen uses a different model architecture, which is not steerable with natural language instructions.

We obtain the frame-action data needed for training from YouTube gameplay videos with synced controller input overlays. We extract the gameplay frames and corresponding gamepad states, then mask out the visible controller overlay and other streaming artifacts in the model input.

We segment the data into 20-second chunks, then filter out chunks where more than 60% of actions are idle, meaning there is no meaningful button or joystick movement. This removes significant amounts of footage where the streamer is talking to chat, sitting through cutscenes, or otherwise producing little useful control signal. This filtering step removes approximately 15% of the total frames.

After filtering, we obtain a dataset of 460 hours of Elden Ring gameplay, converted into 8.2M frame-action samples after 5 Hz sampling for continued pretraining. Web-scale multimodal data is mixed in to prevent catastrophic forgetting.

We also chunk 70 hours of this data into 5-second clips and retrospectively annotate them with instructions for instruction fine-tuning (IFT). The annotation is done using Gemini 2.5 Flash.

Why not a VLA?

In robotics VLAs, one of the reasons for using action heads is because robot actions are low-level continuous values: end-effector deltas, joint targets, gripper commands, etc. Those values do not naturally correspond to anything a pretrained VLM has learned from internet-scale data, and make little “sense” to it.

Games are different. Gamepad and keyboard/mouse symbols, and their meanings in-game, are things that a VLM can somewhat understand due to large-scale pre-training. “Press E,” “attack,” “WASD,” etc. are concepts that appear in gameplay UI elements, offline tutorials, and forums. The model still has to learn the exact control dynamics of the game, but the output space is more language-aligned than robot joint values.

There is another reason to choose this action space over an action vector for a video game agent: steerability.

In VLAs, the decoded action vector can become dominated by visual state conditioning, weakening the influence of the instruction signal. The model learns to focus on one modality over others because it leads to acceptable accuracy. For most game frames, the next action is predictable from what is on screen, and the instruction can become a comparatively weak conditioning signal.

This is a problem when you want precise control. Prompts like “stop walking and stand still” or “stop attacking, step back to recover” have to override the local action prior. If the action head has mostly learned to continue whatever is visually and temporally likely, the instruction-conditioning may not be strong enough to break that momentum. Guidance-style methods can help in some settings, but they are not a clean solution for real-time control.

A VLM using text tokens as executable actions after post-processing gives a more direct steering path, without the added complexity of steering a VLA. The instruction, visual context, history, and action sequence are all handled through the same autoregressive interface. The model is not only producing a hidden continuous trajectory; it is explicitly generating the control tokens under the prompt.

Model choice

For this experiment, we used Qwen-3.5-2B-Base as the base model.

Model choice is strongly influenced by latency considerations. In our experiments, we observed that for individual frame inputs, the quality of model outputs improved with bigger model sizes, as one would expect, but they performed worse in actual closed-loop gameplay testing because they were too slow.

This pushes the practical design toward smaller VLMs. Models of size 2B-9B with a good action representation and optimized inference can be more useful than a much larger model that cannot act quickly enough.

The same constraint applies to reasoning. Reasoning is valuable, but it cannot be paid for at every frame. Most frames require continuation, especially navigation, which makes up the bulk of gameplay frames. Reasoning is most useful at decision boundaries: a failed plan, a fork in the path, etc.

The model should therefore output short reasoning token sequences in a temporally sparse manner and only when required. Plain action should be the default. Every decoded token adds latency, and in an interactive environment extra decode time is costly.

Action representation

We output actions at a rate of 5 Hz with a chunk size of 6. Concretely, this means we send and receive model outputs every 200 ms, with each output consisting of X and Y displacements of left and right joysticks, and 6 chunks each corresponding to a 33 ms sub-window. It looks like this:

j^L_x \; j^L_y \; j^R_x \; j^R_y \,;\, k_1 \,;\, k_2 \,;\, \ldots \,;\, k_6

It captures the following structure: movement direction, camera pan, button states, and trigger values emitted over a short future window. The joystick values lie in [-1, 1], and are discretized into bins of magnitude 0.05 to make learning easier. The output is streamed while being continuously executed, enabling smoother gameplay.

In practice, this means that the model observes at a lower frequency, 200 ms, and acts at a higher control frequency, 33 ms. This lets a VLM-style model operate inside a live environment without requiring impossibly fast full-frame inference.

Training

Continued Pretraining (CPT)

During CPT, the model is learning the motor grammar of the environment. Intuitively, it is helpful to understand the pretraining stage as teaching the model basic instincts of Elden Ring gameplay: given a game frame, what action is a human most likely to perform?

A critical concept to understand with respect to pretraining is agent inertia.

When trained on frame-action pairs from human gameplay, the model ends up biased toward local continuity of play. If the character is already moving forward, the next action is often to keep moving forward. If the camera is already rotating, the next action is often continued camera rotation. If the player is standing still, the next action is often also stillness.

This has its uses, of course. Real control is temporally coherent. Actions are not independent samples. Movement, camera adjustment, combat, and menu navigation all have short-term persistence, and the model needs to learn that persistence to produce smooth behavior.

However, the pretrained model becomes “too good” at continuing the current behavioral mode. Moving agents keep moving, and stationary agents stay stationary. The policy learns the local texture of gameplay without always learning when to interrupt it.

Instruction Fine Tuning (IFT) & Reasoning

Instruction tuning gives the model an external control signal strong enough to break local behavioral inertia. Instead of being driven solely by inertia, the model learns to adapt its actions to goals given as instructions.

In the IFT stage, the model receives both the current frame and its immediate goal as part of its context. With reasoning, the model’s reasoning for what it should do next is fed back into context as the instruction for future steps.

Real-time inference

The inference stack is as important as the training stack. A real-time agent has a latency budget across frame capture, preprocessing, vision encoding, multimodal projection, prefill, token decoding, network latency, and input execution. Larger histories increase prefill, which is very expensive, and more reasoning increases decoding time. Higher frame resolutions improve perception while slowing the loop, but notably have diminishing returns over 720p from our observations.

All of these choices interact. Model size, image resolution, action vocabulary, caching, quantization, and runtime cannot be optimized independently. 2B and 4B models comfortably serve the output within latency constraints when using inference engines like vLLM. For 7B and 9B models, more sophisticated approaches are required to hit latency goals, especially if the server is on the cloud.

Playthrough demo

Next-action accuracy is useful, but it does not determine whether the agent survives its own actions inside the game world. The agent has to actually play the game for us to get a better sense of its capabilities.

What we observed

We saw that the agent learned the following rather quickly during the pretraining phase:

Clicking through simple UI interactions like selecting yes/no, opening/closing settings windows, and skipping cutscenes
Basic combat
Basic game mechanics like resting at Site of Grace and interacting with game elements

An interesting observation was that while navigation did get better as pretraining progressed, even after training on 460 hours of gameplay data, it was still rather unreliable. The agent would still sometimes walk in zig-zag patterns, fall off cliffs and bridges, and try to walk through solid objects. We expect additional and more diverse pretraining data to solve this.

Another issue was that the model barely saw any negative examples in its pretraining data. A human player would rarely walk into a wall and then correct themselves, as navigation is trivial for a human. As a result, the model was not good at self-correcting. This issue can likely be addressed by manually collecting representative negative examples. This is one reason controllable simulation is valuable: it gives us an easy way to generate degraded states that are otherwise under-represented in human demonstrations.

In a similar vein, the model’s in-game instruction-following abilities improved with the amount and diversity of instruction-following clips used in the IFT stage. We expect performance to further improve with more data.

Memory

Short-Term Context

A single frame is often enough to identify the scene, yet insufficient for control in many cases. The correct action often depends on current motion and prior states. In some early experiments, we observe that short visual history and action history mitigate some of the above issues substantially. The model needs to know not only what it sees now, but what just happened. Without that context, it tends to collapse into shallow reactivity.

Long-Term Memory

The more difficult question is long-term memory. Short history helps with combat timing, camera control, and local navigation, but does not really help with objectives that unfold over hundreds or thousands of frames.

This remains one of the weakest parts of the stack, and requires substantial work. A high-performing agent would need some combination of recurrent state, compressed summaries, retrieval, and goal-oriented progress tracking.

What comes next

This was only an initial training run, but the implications are exciting. It reinforces a useful direction: train from human demonstrations while preserving language-steerability, close the loop in real time, and expose the agent to diverse worlds, tasks, and failure modes. Existing games are a practical starting point, but the larger opportunity is controllable world generation: environments where tasks, failures, and evaluation settings can be generated deliberately. If persistent interactive 3D worlds can be generated and trained on at scale, they may become one of the most useful substrates for building more capable embodied agents.

FOOTNOTES