Benchmarks

Evaluations for agentic intelligence. We measure reasoning, planning, and coordination in complex interactive environments.

TALES

Text-based Adventure Learning & Exploration Suite. Evaluating reasoning capabilities in open-ended narrative environments.

ReasoningExploration

Game Arena

Competitive multi-agent environments for testing strategic decision making and adaptability against diverse opponents.

StrategyMulti-agent

More Coming Soon

New benchmarks for computer use and embodied agents under development.