
August 14, 2025 · By Shrey Kothari
Games: The Missing Modality for AGI
The curve that felt inexorable is bending. Up until now, each frontier model arrived with a steeper leap; now the leaps feel smaller. GPT-5, great in its own right, didn't land with the shockwave GPT-4 did. The Bitter Lesson tells us compute is the answer, but what do you use the compute on when you've exhausted all the data?
The last few years in AI were extraordinary. RLHF was the unlock that turned stochastic parrots into helpful, preference-aligned assistants. Companies like Scale AI and Surge AI became the fodder-providers, fueling this growth. Even if progress were to stop here, there's immense scope for applications across industries that would greatly improve global quality of life. But we're not there yet. There are proven methods to take us from today's capabilities to AGI.
A growing, almost retrospective consensus is forming: to approach general intelligence, models must go beyond static RLHF and practice in dynamic tasks that mirror real-world decision-making. Hence the shift toward Reinforcement Fine-Tuning (RFT) and pure RL after pre-training. DeepSeek‑R1, one of the first language models to be purely post-trained with RL (no supervised data) experienced a 56 percentage points improvement on its AIME‑24 benchmark performance. OpenAI's o3 model, trained to reason through RL, outperformed both GPT-4.1 and 4.5, (released around the same time) on benchmarks as well as in the court of public opinion. RFT is qualitatively different from RLHF: instead of optimizing for human approval, an agent learns by acting and receiving rewards from the environment. Over time, that loop shapes agency, strategy, tool use, and meta-skills that static supervision rarely imparts.
The intuition is simple: if we want long-horizon planning, negotiation and collaboration with other agents, and goal-pursuit under open-ended rules and scarcity, the models need experience and not just labeled data.
Games as Interactive Environments
Current models infamously lack abilities– such as theory of mind and coordination –that are crucial for a truly generalized intelligent system. Reward-optimized games as interactive RL environments are the best classrooms to teach these cognitive skills; games compress complex patterns into learnable challenges with clear rewards and curricula. For example, Chess and Go provide well-defined arenas for long-term planning. An RL agent playing these gets a clear win/lose reward at the end, forcing it to plan many moves ahead. Similarly, multi-player cooperative games explicitly test and hone the ability to reason about others' beliefs and intentions. Hanabi is a prominent example in multi-agent RL: success in Hanabi requires modeling teammates' hidden knowledge and communicating with hints. RL in such settings forces an agent to infer what partners know or intend, building a rudimentary theory of mind.
Multi-agent games can also teach coordination and emergent tool use. OpenAI's hide-and-seek experiment is striking: teams of agents in a physics world learned, via self-play, to barricade with boxes, counter with ramps, and iterate through six distinct strategy phases: behaviors unforeseen by the game creators. This kind of co-adaptive environment drives agents to outthink each other, hinting at competition and cooperation: the building blocks of social intelligence. In business terms, similar simulations could train AI assistants to coordinate across complex projects (multiple agents managing different parts, learning when to assist or delegate).
Crucially, games aren't just training grounds, they're becoming the benchmarks for "true" model intelligence. Static benchmarks (AIME, LiveBench, etc.) are great performance snapshots but bad at measuring general competence, and are getting overfit. Games test for a general understanding of the broader world and objectively score performance with auditable logs/replays (without human intervention). Moreover, they automatically scale with the capabilities of the systems since only the best models are pitted against each other.
Recent work is converging on this idea. TextQuests drops agents into 25 classic interactive fiction games to stress-test intrinsic long-horizon reasoning and exploration. TALES (Microsoft) unifies multiple text-adventure frameworks and still finds top models struggling to complete games designed for human enjoyment. Kaggle's Game Arena pits frontier models against each other in Chess and other strategy games to measure real-time planning, adaptation, and competitive reasoning. And then there's Among AIs, a benchmark created by Antim Labs to measure negotiation, cooperation, and deception. Among AIs runs as a live arena where frontier agents solve tasks, form coalitions, and vote out an impostor—generating rich trajectories of persuasion, trust, and betrayal. Together, these game-based evals underscore how far agents remain from robust sequential reasoning under partial observability and delayed reward.
Through game-based RL training and evals, models acquire and test capabilities like planning, self-correction over long tasks, perspective-taking, and teamwork– something static corpora or instruction tuning don't teach. Simulated environments offer a safe sandbox to impart these lessons and a rigorous scoreboard to tell if we're actually getting smarter.
Human Data and Self-Play Environments
While pure self-play RL can yield superhuman strategies, it risks "alien" behaviors or reward hacking, where an AI finds loopholes to maximize reward in unintended ways (e.g., modifying tests to "pass" rather than improving solutions), if started tabula rasa. Human examples define a reference path the AI can mimic initially, making subsequent RL far less likely to veer into bizarre solutions. Introducing human gameplay data is critical to align behavior to human norms, bootstrap complex skills, and jumpstart the policy before RL optimization.
In 2019, DeepMind trained an AI agent called AlphaStar to play StarCraft II. Initially training the model through imitation learning on human game data allowed it to learn the basic micro and macro-strategies used by professional StarCraft players. The resulting agent defeated the built-in "Elite" level AI in 95% of games. This policy was then used to seed a multi-agent RL process, which allowed AlphaStar to decisively beat Team Liquid's Grzegorz "MaNa" Komincz, one of the world's strongest professional StarCraft players, 5-0. Another landmark multi-agent RL experiment is Meta's CICERO Agent, created to play Diplomacy. CICERO combined a language model with strategic planning RL, trained first on human gameplay data (over 13 million human dialogue messages) to learn human-style interaction, then fine-tuned with self-play RL to plan winning moves. The resulting agent was adept at natural conversation and reached top-10% performance in human Diplomacy leagues. Without human grounding, a pure RL agent would have employed undesirable strategies like gibberish communication and pathological betrayal, rendering it unusable in human games.
In general, human trajectories prevent agents from getting stuck in odd corners of the state space and mitigate reward hacking. Humans in the loop set the alignment anchor before the model is allowed to optimize on its own. Commercially, jumpstarting the policy with human gameplay data cuts training time and compute cost; starting near human level means RL doesn't waste millions of steps rediscovering basics.
Antim Labs: Playgrounds for AGI
Traditional data-labeling platforms aren't built for this new regime. They excel at powering RLHF and static feedback pipelines: annotating text, ranking answers, labeling images. But running rich interactive environments—where agents act sequentially, coordinate with others, and are judged by long-horizon outcomes—is a different domain. You need game design, multi-agent orchestration, real-time systems, and RL engineering under one roof. That is the problem space Antim Labs is focused on.
Our work at Antim Labs rests on two pillars that close the gap between current intelligent systems and human-level intelligence. First, we create configurable game environments optimized for signal density and skill diversity. Through iterative model training and game design, we are creating games optimized to improve models. Second, we source diverse human gameplay data through our AI-powered entertainment and gaming platform, 4Wall AI, where over a hundred thousand users create characters, build worlds, and act out stories with their friends and bots.
Under the hood, our team has developed an AI-native world engine: a multi-agent framework that orchestrates many humans and agents over shared state in real-time. This framework used to power Spot, a personalized virtual world on 4Wall AI. The same engine lets us spin up new simulations quickly by hot-swapping rules, maps, tasks, and reward functions. Projects like Replay, our suite of open-sourced game environments, and Among AIs all run on this substrate.
Reinforcement fine-tuning of language models is still nascent, and there's an industry-wide scarcity of rigorous interactive RL environments and evaluation rubrics. We provide the training substrate frontier labs can't build alone: a world engine that spins up environments fast and proprietary interactive data from our creator-powered distribution flywheel. Labs bring the policy hooks; we bring the worlds, the players, and the RL data factory.
Want to learn more? Reach out at shreyko@antimlabs.com or @shreyko