Scaling Behavior Cloning Improves Causal Reasoning: An Open Model for Real-Time Video Game Playing

Yuguang Yue; Irakli Salia; Samuel Hunt; Chris Green; Wenzhe Shi; Jonathan J Hunt

Scaling Behavior Cloning Improves Causal Reasoning: An Open Model for Real-Time Video Game Playing

Intermediate

Yuguang Yue, Irakli Salia, Samuel Hunt et al.1/8/2026

arXiv PDF

Key Summary

•The paper teaches a game-playing AI to copy good human players (behavior cloning) and shows that simply scaling up the model and the data makes the AI reason more causally (it pays attention to what truly causes outcomes on screen).
•They release an open recipe: 8,300+ hours of expert gameplay, full training and inference code, and model checkpoints that run at 20 frames per second on a consumer GPU.
•The model, called Pixels2Play (P2P), looks at video frames and optional text instructions and outputs keyboard and mouse actions in real time.
•A clever action decoder lets the big model make only one action token per frame, then a tiny helper expands it into full mouse+keyboard moves; this speeds up inference about 5×.
•They carefully fix the training–inference gap (differences between recorded training frames and live frames) with matching resize code, RGB choices, and rich data augmentation, which greatly improves live play.
•In two simple Godot games and in real titles like DOOM, Quake, and Roblox, bigger models perform better and look more human-like in blind human evaluations.
•A toy study and a large-scale test both show that deeper, larger models trained on more diverse data rely more on visual causes and less on “just copying previous actions.”
•They also try using unlabeled videos via an inverse dynamics model to guess actions; this lowers test loss but doesn’t clearly win in human preference yet.
•Instruction following works: short text hints (like “press the red button”) significantly raise success in a Quake maze.
•Bottom line: scaling behavior cloning with the right architecture and data hygiene is a practical path to better, more causal, real-time game agents.

Why This Research Matters

Real-time, general game agents can help everyday players with coaching, accessibility, and smoother camera control across many titles on consumer hardware. Better causal reasoning means fewer frustrating moments—less jitter, fewer loops, and actions that respond to what’s truly on screen. An open recipe lowers the barrier for students, hobbyists, and researchers to build, test, and improve such agents. Insights about scaling and causality transfer beyond games to robots, simulations, and interactive learning tools. Instruction following hints at goal-driven assistants that can adapt to user prompts in dynamic worlds. Finally, tightening the training–inference match is a blueprint for making many vision-action systems work robustly outside the lab.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how you can get good at a game by watching a skilled friend and then trying to do what they do?

🥬 Filling (The Actual Concept): What it is: Behavior cloning is teaching a computer to play by copying examples of state→action pairs from humans. How it works: (1) Record the screen frames and the player’s keyboard/mouse; (2) Train a model to predict the next action from recent frames; (3) Deploy it to play in real time. Why it matters: Without copying strong demonstrations, the AI would need complicated reward signals and long training runs inside the game.

🍞 Bottom Bread (Anchor): Imagine recording 8,300+ hours of pro gameplay and then training an AI to press WASD and move the mouse like they did when it sees similar frames.

🍞 Top Bread (Hook): Think about learning to bike indoors on a carpet and then wobbling outside on gravel; the feel is different.

🥬 Filling: What it is: Distributional shift is when the situations at test time are different from what was in training. How it works: (1) Model learns from a dataset; (2) In the real game, it sees new camera motions, textures, lighting, or gets stuck; (3) Its predictions degrade. Why it matters: Without handling shift, the agent may freeze, repeat loops, or crash into walls.

🍞 Bottom Bread: The team lets the model play and a human “corrects course” when it gets stuck, adding these correction snippets back into training.

🍞 Top Bread (Hook): If you always brake when you see brake lights, you’re mixing up cause and effect.

🥬 Filling: What it is: Causal confusion is when the AI latches onto clues that correlate with actions but don’t cause them. How it works: (1) The dataset has shortcuts (e.g., brake lights often show when braking); (2) The model learns the shortcut; (3) In new scenes, the shortcut fails. Why it matters: The agent behaves poorly when the non-causal clue appears without the true cause.

🍞 Bottom Bread: In driving-like frames, the model might brake when it “sees” a past-frame brake light instead of an actual obstacle ahead.

🍞 Top Bread (Hook): Picture a super coach who sets special goals for you like “reach the red door,” not just “turn left.”

🥬 Filling: What it is: Text conditioning means giving the model short instructions along with images. How it works: (1) A text encoder turns a sentence into an embedding; (2) The policy attends to images and text together; (3) The output actions follow the goal. Why it matters: Without text, the agent might not emphasize key objectives in ambiguous scenes.

🍞 Bottom Bread: In a Quake maze, “press the red button” increased success versus no text.

🍞 Top Bread (Hook): Imagine trying to play every PC game with one controller layout that handles all keys and mouse moves.

🥬 Filling: What it is: A general, real-time, multi-game policy is a single model that must speak the full PC action “language.” How it works: (1) The model reads frames; (2) It predicts a compact action token; (3) A small decoder expands it to keyboard/mouse details; (4) It runs at 20 Hz on a consumer GPU. Why it matters: Without a compact action design, real-time play would be too slow.

🍞 Bottom Bread: Instead of predicting dozens of key/mouse tokens per frame, the big model emits one token and a tiny helper unpacks it quickly.

Before this work, many top game AIs used reinforcement learning tailored to a single game with handcrafted rewards. That was powerful but not general—and heavy to run. Behavior cloning is simpler and game-agnostic but suffers from distributional shift and causal confusion. Past attempts tried freezing large vision-language models or simplifying action spaces to one-hot choices; those trade realism and speed for convenience. What was missing was an open, fast, multi-game recipe that shows, with evidence, how to reduce causal confusion as models and data scale. Why this matters in daily life: smoother camera movement, fewer dizzy jitters, and an agent that actually presses the right button when you ask—on your own PC, in real time, and across many games.

02Core Idea

🍞 Top Bread (Hook): Imagine getting better at piano simply by listening to more great recordings and practicing with a smarter, faster metronome.

🥬 Filling (The Actual Concept): What it is: The key insight is that scaling behavior cloning—bigger models, more diverse data—plus a smart action-decoder design and careful data hygiene, makes the agent more causal and more human-like in real-time play. How it works: (1) Build a fast transformer that runs at 20 Hz; (2) Tokenize images efficiently and condition on short text goals; (3) Predict a compact action token, then expand it with a tiny action decoder; (4) Close the training–inference gap (augmentations, color space, identical resizing); (5) Scale data and depth to push down loss and up causal dependence on visuals. Why it matters: Without scaling and the architecture/recipe, the agent copies recent actions, jitters, and misses goals.

🍞 Bottom Bread (Anchor): A 1.2B-parameter model trained on the full dataset performs better and behaves more causally than smaller ones, and humans prefer its videos.

Multiple analogies:

Sponge analogy: A bigger sponge (model) with cleaner water (curated, augmented data) soaks up the parts that matter (causal signals) and leaves the dirt (spurious shortcuts).
Orchestra analogy: The main transformer is the conductor; the action decoder is a nimble soloist that turns the conductor’s cue into detailed notes, keeping the whole performance on tempo (real time).
Detective analogy: With more cases (data) and a sharper mind (depth), the detective learns to trust fingerprints and alibis (visual causes), not rumors (past-action shortcuts).

🍞 Top Bread (Hook): You know how practice logs show steady improvement, often following a curve?

🥬 Filling: What it is: Scaling laws describe how test loss falls in a smooth pattern as you add data or parameters. How it works: (1) Train models at several sizes and data fractions; (2) Fit a simple curve relating loss to data; (3) See consistent gains for larger, deeper models with more data. Why it matters: It tells you whether buying more data or deepening the model will still pay off.

🍞 Bottom Bread: The 1.2B model’s loss versus data follows a power-law curve, so more frames predictably help.

🍞 Top Bread (Hook): Think of giving one big instruction—“Make a sandwich”—and having a helper do the bread, the fillings, and the cut.

🥬 Filling: What it is: The action decoder lets the big model predict just one action token; a tiny transformer turns it into full keyboard and mouse outputs. How it works: (1) Main model emits one compact token; (2) Small decoder autoregressively expands to 8 tokens (4 keys, 2 mouse movement, 2 mouse buttons); (3) This reduces compute per frame. Why it matters: Without it, predicting every sub-action inside the big model would be too slow for 20 Hz.

🍞 Bottom Bread: This design yields around a 5× speedup at inference.

Before vs After:

Before: Single-game agents or slow, large VLMs; fragile behavior cloning that overuses prior actions and struggles live.
After: A single open model that runs in real time across PC games, uses text hints, and improves causal behavior by scaling the right parts.

Why it works (intuition):

Depth and diverse data expose shortcut features as unreliable; the model learns stable, cause-tied cues in frames.
The attention mask and token layout avoid peeking at current ground-truth actions, stopping information leakage.
Augmentations and exact resizing/color pipelines make train frames “feel” like live frames, so skills transfer.

Building blocks:

Efficient image tokens (few per frame) to look longer back in time.
Text conditioning for goal shaping.
Custom attention mask and a “reasoning” token for extra thinking per frame.
Autoregressive action decoder for fast, fine-grained control.
Data augmentation and correction data to survive real deployment.
Simple, measurable scaling that predicts steady gains.

03Methodology

High-level overview: Input (video frames + optional text + past actions) → Policy Transformer (with thinking and action-prediction tokens) → Tiny Action Decoder (expands to full keyboard/mouse) → Real-time actions at 20 Hz.

🍞 Top Bread (Hook): Think of making a cookbook: great recipes, clear photos, and notes from expert chefs.

🥬 Filling (Dataset – Annotated Gameplay): What it is: 8,300+ hours of expert 3D gameplay paired with keyboard/mouse actions at 20 FPS. How it works: (1) Record only active play; (2) Keep a diverse mix of games, hardware, resolutions; (3) Filter bad clips; (4) Add a little correction data where humans rescue the agent. Why it matters: Without high-quality, varied demonstrations, the model learns brittle shortcuts.

🍞 Bottom Bread (Anchor): From Roblox to DOOM and Quake, the model sees many ways humans aim, move, and recover.

🍞 Top Bread (Hook): A coach’s sticky note, “Head to the red door,” can change how you play the next minute.

🥬 Filling (Text Conditioning): What it is: Short, goal-like instructions attached to time windows. How it works: (1) A VLM watches downsampled clips offline and writes macro instructions with timestamps; (2) A text encoder turns the sentence into an embedding; (3) The policy conditions on it when present. Why it matters: Without goals, the agent may miss rare but crucial actions (like pressing a button) in ambiguous scenes.

🍞 Bottom Bread: “Press the red button” in a Quake maze clearly boosts success.

🍞 Top Bread (Hook): Packing a suitcase smartly lets you bring more outfits in less space.

🥬 Filling (Image Tokenization): What it is: A lightweight image encoder (EfficientNet layers) that turns each 192×192 frame into very few tokens. How it works: (1) Extract early features; (2) Project into 1–4 visual tokens; (3) Train the encoder with the policy (unfrozen). Why it matters: Without few tokens, looking far back in time gets too slow; freezing the encoder hurts representation quality.

🍞 Bottom Bread: With just 1–4 tokens per frame, the model can attend across long histories for smoother camera and aiming.

🍞 Top Bread (Hook): Instead of shouting every step of a dance, a choreographer can give a single cue and a dancer fills in the details.

🥬 Filling (Action Decoder): What it is: A tiny transformer that expands one compact action token into full keyboard/mouse outputs. How it works: (1) The big policy emits one action-prediction token; (2) The small decoder autoregressively outputs 8 tokens (4 keys, 2 mouse move, 2 mouse buttons); (3) Only one big forward pass per frame. Why it matters: Without it, the big model would be too slow to predict all sub-actions at 20 Hz.

🍞 Bottom Bread: This yields about a 5× real-time speedup.

🍞 Top Bread (Hook): Don’t let a student peek at the answer while solving a math problem.

🥬 Filling (Attention Mask to Ensure Causality): What it is: A custom mask that stops the action-prediction token from looking at the current ground-truth action. How it works: (1) During training, past true actions are inputs; (2) The mask forbids seeing the same-step action; (3) Other tokens can attend appropriately. Why it matters: Without this, the model could cheat and copy, worsening causal confusion.

🍞 Bottom Bread: The policy must base its choice on frames (and text), not on the hidden label it’s supposed to predict.

🍞 Top Bread (Hook): Practicing on pianos that feel like your concert piano prevents a surprise on stage.

🥬 Filling (Fixing the Training–Inference Gap): What it is: Making training frames match live frames as closely as possible. How it works: (1) Use RGB over YUV when possible; (2) Make the resize function bitwise-identical across train and inference; (3) Add realistic augmentations (color jitter, blur, noise, tiny rotations, Planckian jitter). Why it matters: Without this, the agent looks great offline but stumbles online with shaky camera or idle behavior.

🍞 Bottom Bread: After these fixes, online play and human preference improve.

🍞 Top Bread (Hook): A mouse with gentle curves near the center and capped extremes feels smooth to use.

🥬 Filling (Mouse Discretization + Sampling): What it is: Quantile-based bins with truncated-normal sampling inside each bin. How it works: (1) Discretize x/y mouse moves with fine resolution near zero; (2) Fit a truncated normal per axis; (3) At inference, sample within the predicted bin. Why it matters: Without this, aiming can feel too twitchy or too sluggish.

🍞 Bottom Bread: The result is smoother, steadier crosshair motion.

🍞 Top Bread (Hook): If you have lots of silent movies, you can guess the missing dialogue by learning lip-reading.

🥬 Filling (Unlabeled Data via Inverse Dynamics Model): What it is: A helper model that infers actions from videos without labels, to pretrain the policy. How it works: (1) Train an IDM on labeled data; (2) Use it to pseudo-label large unlabeled gameplay; (3) Pretrain the policy, then fine-tune on real labels. Why it matters: Without leveraging unlabeled video, you leave massive learning signal unused.

🍞 Bottom Bread: This lowers test loss, although human preference gains are not yet clear.

Finally, deployment: The model uses key–value caching and sliding-window attention for speed, and runs at 20 Hz on a consumer GPU. Camera sensitivities are tuned per game to keep micro-aiming stable. Text is used only when instruction-following is tested; otherwise, a default “no-text” token is fed.

Secret sauce:

The action decoder for speed and stability.
The custom attention mask and “reasoning” token for clean, causal decisions per frame.
Ruthless focus on closing the training–inference gap so offline skill shows up online.

04Experiments & Results

🍞 Top Bread (Hook): When you test a new bike, you try a flat loop, a small obstacle course, then real traffic.

🥬 Filling (The Test): What it is: Three kinds of evaluations—(1) simple Godot games for controlled scores; (2) blinded human ratings on real titles; (3) scaling and causality analyses. How it works: (1) Hovercraft: time to complete a loop; (2) Simple-FPS: hits on enemy minus hits taken; (3) Human rubric counts wall bumps, idle time, jitter, etc.; (4) Causality measured by how much predictions change if you swap some frames. Why it matters: Numbers plus human eyes reveal both skill and “feel.”

🍞 Bottom Bread (Anchor): Larger models get better scores in Godot games and are preferred more often by humans watching DOOM/Quake/Roblox clips.

Simple programmatic environments:

Across 16 runs per model, the 1.2B model posts the best mean scores and lowest variance. Real-time throughput stays high (tens of FPS) on an RTX 5090.

Human evaluation on real games:

Evaluators, blinded to which model played, counted issues (wall collisions, air shots, misses, idle, jitter, non-human moves). Bigger models consistently had fewer issues, and preference charts favored them.

🍞 Top Bread (Hook): A sticky note, “press the red button,” can rescue a maze run.

🥬 Filling (Instruction Following): What it is: Using short text hints during play. How it works: In a Quake maze, models run from the same checkpoint with and without the instruction “press the red button.” Why it matters: Without text, the policy often overlooks this rare but required action.

🍞 Bottom Bread: With the instruction, success rates jump for all model sizes.

🍞 Top Bread (Hook): Think of a graph where practice time predicts exam scores in a smooth curve.

🥬 Filling (Scaling Laws Result): What it is: Test loss follows a predictable curve as data grows. How it works: Train 150M–1.2B models on 6%–100% of ~500M frames; pick the best checkpoint; fit a simple curve. Why it matters: Predictability lets you plan whether more data or deeper models are worth it.

🍞 Bottom Bread: The 1.2B model’s best loss versus data fits a clean power-law, with larger models benefiting more when data is plenty.

🍞 Top Bread (Hook): If a student starts answering just from the last answer they wrote, you know they’re not reading the new question.

🥬 Filling (Causality Tests): What it is: Two tests—(1) a toy two-feature world; (2) a large-scale “causality score” in real data. How it works: Toy: compare obstacle vs prior brake-light to see if the network learns the true cause; deeper MLPs learn causal behavior faster than linear models under SGD. Large-scale: compute KL divergence between outputs on original vs partially frame-swapped inputs (actions unchanged); higher KL means heavier reliance on visuals. Why it matters: Without causal dependence on frames, the agent overuses action priors and fails in new scenes.

🍞 Bottom Bread: Bigger, deeper models trained longer and on more unique frames show higher causality scores, except in extremely tiny-data cases.

Unlabeled pretraining:

A 600M model pretrained on pseudo-labeled videos achieves noticeably lower test loss than the same-size model trained only on labeled data. However, in human preference tests, it doesn’t clearly win yet—likely due to diverse but mismatched motion styles in unlabeled sources.

Surprises:

Even though a perfect linear policy exists in the toy problem, SGD with random init doesn’t find it; shallow linear models stall while nonlinear ones learn the causal rule.
Causality scores can keep rising even when test loss starts to overfit, so both metrics should be viewed together.

05Discussion & Limitations

Limitations:

Instruction following is simple and template-like; richer, longer, nested goals were not covered by the limited text annotations.
The unlabeled pretraining improves test loss but doesn’t yet translate into clearly better human-rated play, possibly because pseudo-labels import odd movements or game-specific quirks.
The model shines in reactive, short-horizon tasks; complex long-term planning or puzzle-solving remains limited.
Real-time constraints cap model size and context length; extremely deep lookback or heavier vision stacks aren’t feasible at 20 Hz on a single consumer GPU.
Multi-game generality is good but not universal; unusual UIs, HUDs, or very high mouse sensitivity can still trip the model.

Required resources:

A high-end consumer GPU (e.g., RTX 5090) for inference at 20 Hz, and multi-GPU training (8× H100 used in the paper).
Storage and bandwidth for 8,300+ hours of videos and augmentations.
Optional access to a commercial VLM for bulk text annotation and for filtering unlabeled videos.

When NOT to use:

Tasks needing strategic planning over many minutes or with hidden objectives that require complex memory and reasoning.
Environments with actions outside standard keyboard/mouse or with very rare, high-stakes maneuvers lacking training examples.
Settings where safety requires guarantees beyond imitation (e.g., no-fail constraints), since BC offers no formal safety proofs.

Open questions:

How to scale instruction diversity so the agent follows complex, multi-step, or abstract goals robustly?
Can better pseudo-labelers (e.g., latent-action IDMs or hybrid world models) make unlabeled pretraining improve human preference, not just test loss?
What token layouts or temporal memories best balance real-time speed and long-horizon reasoning?
Can we devise direct, online measures of causal dependence during training, beyond KL on frame swaps?
How far do the scaling laws hold under more diverse, messy, or adversarial gameplay streams?

06Conclusion & Future Work

Three-sentence summary: This paper releases an open, real-time, multi-game behavior cloning recipe—data, code, and models—that plays PC games from pixels and optional text. The key finding is that scaling model capacity and dataset size, together with a fast action decoder and careful data hygiene, makes policies more causal and more human-like. Experiments from toy worlds to billion-parameter models show cleaner scaling curves and higher causality scores, with humans preferring larger models.

Main achievement: Proving in practice that “just scale BC (the right way)”—with compact vision tokens, a custom attention mask, an action decoder, and meticulous train–inference matching—produces real, measurable gains in causal reasoning and online gameplay across many 3D titles.

Future directions:

Expand instruction datasets (variety and length) so agents follow complex goals and adapt on the fly.
Improve unlabeled pretraining with stronger IDMs or latent-action pipelines that transfer to human preference.
Explore longer temporal memory while preserving 20 Hz speed, perhaps via hierarchical tokens or event-driven attention.
Add lightweight planning or value-estimation heads to blend imitation with simple foresight.

Why remember this: It’s an end-to-end, open, reproducible path showing that behavior cloning—often doubted for causality—can, when scaled and engineered carefully, deliver fast, general, and more causally grounded game agents that run on everyday hardware.

Practical Applications

•Aim assistant that reduces jitter and overshoot for players with motor impairments.
•Practice partner bots for FPS warm-ups that move and aim like humans in custom maps.
•In-game tutorial agents that follow short text goals to demonstrate puzzle steps on demand.
•Automated QA bots that traverse levels reliably to test doors, buttons, and checkpoints.
•Esports VOD analyzers that generate macro instructions and actionable drills from gameplay.
•Data collection tools that auto-flag poor-quality segments and suggest targeted replays.
•Modding frameworks that drop in a real-time agent for sandbox challenges and user maps.
•Research baselines for studying causal learning and scaling in embodied settings.
•Game streaming helpers that keep camera steady during busy scenes for viewer comfort.
•Robotics sim pretraining, transferring causal visual attention patterns to real-world tasks.

Version: 1