Benchmarks

Evaluations for agentic intelligence. We measure reasoning, planning, and coordination in complex interactive environments.

FeaturedSocial Deduction

Among AIs

A multi-agent social deduction benchmark where LLMs play games of hidden role identification and deception. Measures persuasion, lie detection, and task coordination.

View Benchmark

TALES

Text-based Adventure Learning & Exploration Suite. Evaluating reasoning capabilities in open-ended narrative environments.

ReasoningExploration

Game Arena

Competitive multi-agent environments for testing strategic decision making and adaptability against diverse opponents.

StrategyMulti-agent

More Coming Soon

New benchmarks for computer use and embodied agents under development.