GameDevBench: Evaluating Agentic Capabilities Through Game Development

Wayne Chi; Yixiong Fang; Arnav Yayavaram; Siddharth Yayavaram; Seth Karten; Qiuhong Anna Wei; Runkun Chen; Alexander Wang; Valerie Chen; Ameet Talwalkar; Chris Donahue

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Intermediate

Wayne Chi, Yixiong Fang, Arnav Yayavaram et al.2/11/2026

arXiv

Key Summary

•GameDevBench is a new test that checks if AI agents can actually make parts of video games, not just write code in one file.
•It uses the Godot game engine and 132 tasks taken from real tutorials, so agents must handle code plus images, shaders, sounds, and animations.
•Tasks are hard: on average, solutions change about 5 files and 106 lines across 3–4 different file types—over three times more edits than popular software benchmarks.
•Agents do much better on gameplay logic than on graphics and animation; success drops from about 46.9% to 31.6% when deep visual understanding is needed.
•Two simple feedback tools—editor screenshots (via MCP) and short gameplay videos—consistently boost scores by giving agents visual clues.
•The top score reaches 54.5% when enabling screenshots and video for Gemini 3 Pro; Claude Sonnet 4.5 jumps from 33.3% to 47.7% with video help.
•Results depend on both the model and the agent framework (e.g., native tools vs. OpenHands) and show big differences in cost-effectiveness.
•Tests are deterministic inside Godot, so correctness is checked by game behavior (like collisions and camera view), not by a judge model.
•The benchmark is public and renewable, built from tutorials with automated construction, auto-refinement, and quick human validation.
•Findings highlight a major gap: current agents need stronger multimodal skills and better game-specific patterns to succeed.

Why This Research Matters

Real apps mix many media types—text, images, animation, and logic—and GameDevBench mirrors that reality. By grading agents inside a real engine with automatic tests, we learn whether they can truly build working features, not just write pretty code. The simple visual feedback tricks (screenshots and short videos) show a practical, low-cost way to boost agent reliability across multimodal tasks. Lessons here can transfer to designing interfaces, educational tools, simulation dashboards, and even robotics displays where vision and code meet. As models improve on this benchmark, they inch closer to being real teammates in creative, technical projects. This could speed up small studios, help classrooms teach game design, and reduce the gap between idea and playable prototype.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re building a LEGO city. You don’t just use one kind of brick—you need windows, doors, wheels, and people. Video games are like that, but with code, pictures, sounds, and moving animations all working together.

🥬 The Concept (Game engine):

What it is: A game engine is a big toolbox that helps you build and run games.
How it works: 1) You drop in scenes and objects (like characters and levels), 2) connect scripts that tell things how to move and react, 3) add art and sounds, 4) press play to run the game.
Why it matters: Without a game engine, making games is slow and messy, like building a LEGO city with no baseplate or instructions. 🍞 Anchor: Godot is one such engine; you can add a character sprite, attach a movement script, and instantly test jumping.

🍞 Hook: You know how your school binder has different tabs—math, science, art—each holding different things? Games also store different things in different file types.

🥬 The Concept (File types in game development):

What it is: Game projects use code files, scene files, images, sounds, shaders, and more.
How it works: 1) Code (.gd) tells objects what to do, 2) Scenes (.tscn) organize objects, 3) Images (.png) draw what you see, 4) Audio (.wav) adds sound, 5) Shaders (.gdshader/.tres) control fancy visuals.
Why it matters: If an agent only edits code and ignores images or shaders, the game may run but look wrong. 🍞 Anchor: A walking animation needs picking the right sprite frames (images) and setting the speed in a scene.

🍞 Hook: Watching a movie means using your eyes and ears together. Games need that too.

🥬 The Concept (Multimodal understanding):

What it is: The ability to understand and connect many kinds of data—text, code, images, animations, and timing.
How it works: 1) Read code to know logic, 2) look at images to pick the right frames, 3) use video or previews to confirm timing and camera view, 4) adjust everything to match the goal.
Why it matters: Without it, an AI might choose the wrong animation or place objects off-screen. 🍞 Anchor: If you ask for a character’s “run” loop, the agent must find the run frames in a spritesheet—not the idle frames—and set the right speed.

🍞 Hook: Teachers give tests to see what you really understand. We need a test for AIs that try to build games.

🥬 The Concept (Benchmark):

What it is: A benchmark is a collection of tasks with clear scores to compare methods fairly.
How it works: 1) Give the same tasks to all agents, 2) run automatic checks, 3) record pass/fail.
Why it matters: Without a shared test, we can’t tell if one agent is truly better. 🍞 Anchor: GameDevBench is such a test for making real, multimodal game changes inside Godot.

🍞 Hook: Imagine reading a recipe and also seeing a photo of the finished dish—it’s easier to cook correctly.

🥬 The Concept (Deterministic verification):

What it is: Automatic checks inside Godot that confirm the game behaves exactly as required every time.
How it works: 1) Unit tests load the scene, 2) they check collisions, animations, and camera visibility, 3) they declare pass or fail with no human judging.
Why it matters: Removes guesswork and bias; the same task always has the same result. 🍞 Anchor: A test can confirm the correct animation name is playing frame-by-frame, not just a similar one.

Before this paper, AI agents were improving at pure coding, but lagging at multimodal jobs like game development. Many tests only measured code changes in a single language and missed visuals, animation timing, and scene structure. People tried related tasks—like predicting next video frames or generating levels—but those didn’t measure the full, hands-on process of building a working game scene in an engine. The missing piece was a single, realistic place where agents must wrangle big codebases and visual assets, then be graded by exact game behavior.

GameDevBench fills that gap. It gathers 132 real tasks from video and web tutorials, covering 2D and 3D graphics, gameplay logic, and UI. It’s tough: on average, correct fixes touch about 5 files and 106 lines across 3–4 file types—over three times more edits than a popular software benchmark. Tests are inside Godot, so success is proven by physics and visuals that actually work. Why should you care? Because the same multimodal skills used to place colliders or pick animation frames can help AI handle other real-world jobs that mix code with visuals—like robotics, design tools, and user interfaces.

02Core Idea

🍞 Hook: Imagine a coach who doesn’t just ask you math questions but also watches you build a model bridge and tests if it holds weight. That’s a better way to see what you can really do.

🥬 The Concept (GameDevBench):

What it is: A benchmark that tests if AI agents can build parts of real games—editing scenes, code, images, and animations—then proves success with automatic in-engine tests.
How it works: 1) Collect realistic tasks from tutorials, 2) give agents a full Godot project plus instructions, 3) let agents edit files, 4) run Godot tests to verify visuals and physics, 5) score pass@1 (did it solve on first try?).
Why it matters: It checks the hard, messy, multimodal parts that normal coding tests miss. 🍞 Anchor: A task might ask for a run animation using specific frames; the test confirms the exact animation plays at the right speed.

The Aha! moment in one sentence: If we evaluate agents inside a real game engine with visual assets and strict, automatic tests, we can finally measure true multimodal building skills—not just code writing.

Three analogies:

Cooking show: Instead of only reading your recipe, the judge tastes your dish (engine tests) and looks at the plating (visuals) to score you.
Shop class: You’re graded on the birdhouse you build (scene + scripts + assets), not just on a blueprint you drew.
Orchestra: You don’t just know notes; you must make all sections (code, sprites, shaders, sounds) play in sync.

Before vs. After:

Before: Agents aced many text-only code tasks but stumbled when images, animations, and scene structure mattered.
After: Agents are challenged by realistic game builds and graded by behavior; we learn where they fail (often visuals) and how to help (give editor screenshots or short videos).

Why it works (intuition):

Real constraints: The game must run, animations must be correct, colliders must touch—no hand-wavy judging.
Rich signals: Visual screenshots and gameplay videos give agents the same feedback humans use to fix mistakes.
Broad coverage: Tasks span UI, logic, 2D/3D visuals—so we see strengths and weaknesses across skills.

Building blocks: 🍞 Hook: You know how a school fair has different booths—art, science, sports—so everyone’s skills get tested?

🥬 The Concept (Task categories):

What it is: Four skill groups—Gameplay Logic, 2D Graphics/Animation, 3D Graphics/Animation, and User Interface.
How it works: 1) Label each task by the main skill it needs, 2) also label which editor type it touches (scene, script, contextual like animation/shader), 3) compare model performance per group.
Why it matters: It reveals that agents do best on gameplay and worst on graphics-heavy tasks. 🍞 Anchor: Agents average ~46.9% on gameplay but ~31.6% on 2D graphics tasks.

🍞 Hook: When you’re stuck on a puzzle, a picture hint can help.

🥬 The Concept (Multimodal feedback tools):

What it is: Two aids—(1) MCP editor screenshots and (2) short gameplay videos.
How it works: 1) Screenshot tool returns a live view of the editor UI (scene tree, inspector), 2) video tool records the camera’s view over time, 3) agent uses these to check what’s wrong (like off-screen objects or wrong animation).
Why it matters: Agents fix more tasks when they can see what the game actually looks like. 🍞 Anchor: Claude Sonnet 4.5 jumps from 33.3% to 47.7% when allowed to use video.

🍞 Hook: A stopwatch time is fairer than a judge’s opinion.

🥬 The Concept (Deterministic tests):

What it is: Exact, repeatable checks inside Godot that confirm behavior.
How it works: 1) Load scenes, 2) assert collisions, animation names, camera visibility, 3) pass or fail.
Why it matters: Removes bias and gives clear signals for improvement. 🍞 Anchor: A test can assert the camera sees five green spheres within a range, not just “the scene looks nice.”

03Methodology

At a high level: Tutorials and their repos → Task construction (with tests) → Agent attempts (code edits, optional visual feedback) → Godot-run verification → Scores and analysis.

Stage 1: Data Preparation 🍞 Hook: Think of gathering recipes before cooking dinner. 🥬 The Concept (Tutorial sourcing):

What it is: Collecting Godot 4 tutorials (video and web) that have open-source repos.
How it works: 1) YouTube transcripts are pulled, repos linked from descriptions, 2) web tutorials scraped (e.g., KidsCanCode), 3) only Godot 4 kept, 4) organize each tutorial into a folder.
Why it matters: Tasks come from real teaching material, ensuring realistic goals and assets. 🍞 Anchor: They end up with 57 usable video tutorials and 31 top-picked web tutorials.

Stage 2: Automatic Task Construction 🍞 Hook: Imagine a helper turning a long recipe video into step-by-step cards. 🥬 The Concept (Task authoring agent):

What it is: An agent (using GPT-5 family) converts tutorials into testable tasks.
How it works: 1) Read transcript and repo, 2) split into smaller skills (e.g., animation vs. colliders), 3) write clear instructions, 4) build unit tests that match instructions, 5) create both a starting project and a ground-truth solution.
Why it matters: Produces many diverse tasks quickly, each tied to a real repo. 🍞 Anchor: From 202 initial tasks, they refine down to 132 final tasks (115 bases + 17 variants).

Stage 3: Task Refinement 🍞 Hook: Editors proofread to catch mistakes before printing. 🥬 The Concept (Auto checks + fix-it prompts):

What it is: Prompts and checklists that spot common errors (ambiguous names, too-strict tests).
How it works: 1) Run scripted validations, 2) auto-fix typical issues (like off-camera scenes), 3) re-check alignment between instructions and tests.
Why it matters: Keeps tasks solvable and clear. 🍞 Anchor: A preliminary study showed most issues were minor and fixable with this process.

Stage 4: Human Annotation 🍞 Hook: A referee confirms the field is fair before the game starts. 🥬 The Concept (Human validation):

What it is: 8 annotators (5 with game-dev experience) verify correctness, remove bad tasks, and create slight variations.
How it works: 1) Check instructions match tutorials, 2) ensure tests are neither vague nor too picky, 3) confirm projects run.
Why it matters: Final polish ensures quality and fairness. 🍞 Anchor: They keep 115 base tasks and add 17 controlled variants.

Running Agents 🍞 Hook: If you give a student a lab kit and a microscope, they can discover more. 🥬 The Concept (Agent frameworks):

What it is: Different local tools that let models read and edit project files and, if allowed, use visual feedback.
How it works: 1) Models run in frameworks like claude-code, gemini-cli, codex, or OpenHands, 2) they edit code/assets, 3) optional tools: screenshots via MCP or videos via Godot.
Why it matters: Frameworks change how well models can interact with projects—and their final performance. 🍞 Anchor: Some models score higher in OpenHands; others do best in their native tools.

Scoring 🍞 Hook: In sports, a scoreboard shows wins and losses clearly. 🥬 The Concept (pass@1 success rate):

What it is: Did the agent solve the task on its first attempt?
How it works: 1) Agent edits, 2) Godot tests run, 3) pass or fail is recorded.
Why it matters: Simple, fair comparison across models and tools. 🍞 Anchor: Gemini 3 Pro reaches 54.5% with screenshots+video.

The Secret Sauce 🍞 Hook: A flashlight and a map make a nighttime hike much easier. 🥬 The Concept (Visual feedback—MCP screenshots and runtime video):

What it is: Letting agents see the editor or the actual camera view over time.
How it works: 1) MCP tool returns a still image of editor panels (scene tree, inspector), 2) runtime video captures the exact game camera view, including animation timing, 3) agents use these to catch wrong animations, off-screen objects, or missing connections.
Why it matters: This simple visibility upgrade consistently improves results without changing the underlying models. 🍞 Anchor: Claude Sonnet 4.5 improves by 14.4 percentage points (33.3% → 47.7%) with video.

Example Data Flow: Input (tutorial-derived task) → Agent edits scene/scripts/assets → Optional: request MCP screenshot or record short video → Run Godot unit tests → Output pass/fail + logs → Aggregate scores per skill and editor type.

04Experiments & Results

The Test 🍞 Hook: When you try a new board game, you read the rules and then see who can finish the missions. 🥬 The Concept (Evaluation protocol):

What it is: Measure how often agents complete tasks (pass@1), and how hard those tasks are (files/lines edited across file types).
How it works: 1) Each model runs through tasks, 2) optional visual feedback is toggled (none, screenshots, video, both), 3) scores are compared across skill categories and editor types.
Why it matters: This shows what agents can really build, not just what they can describe. 🍞 Anchor: Average reference solutions touch ~5 files and ~106 lines across ~3.4 file types, far more than many coding-only tests.

The Competition 🍞 Hook: A track meet works best when fast sprinters race side by side on the same track. 🥬 The Concept (Model and framework lineup):

What it is: Claude (Haiku, Sonnet, Opus), Gemini (Flash, Pro), GPT 5.1 Codex, Kimi K2.5, and Qwen3-VL, run in native frameworks or OpenHands.
How it works: 1) Each agent starts in the project folder, 2) they can run Godot, 3) sometimes they can also grab editor screenshots or make short videos.
Why it matters: Different models and frameworks change both skills and costs. 🍞 Anchor: Qwen3-VL does great on a frontend benchmark but manages only ~8% here, showing game development is a different beast.

The Scoreboard (with context)

Baselines (no extra visuals, native frameworks): GPT Codex ≈ 34.1%, Claude Sonnet ≈ 33.3%, Claude Opus ≈ 39.4%, Gemini Flash ≈ 47.0%, Gemini Pro ≈ 46.2%.
With visual help, top results rise: Gemini Pro to 54.5% (screenshots+video), Claude Sonnet to 47.7% with video.
Skill gap: Gameplay tasks are easier (~46.9% average) than 2D graphics (~31.6%), revealing a multimodal weakness.
Editor gap: Stronger models handle scene and contextual editors about as well as scripts; weaker ones drop on scene/contextual tasks.
Cost-effectiveness: Gemini Flash is most cost-efficient; Claude models often cost more per solved task. Enabling screenshots/video usually raises cost but also boosts success.

Surprising Findings

Visual feedback is simple yet powerful: Just seeing the editor or a short video helps agents correct many mistakes (like wrong sprite choice or off-screen placements).
Frameworks matter: Some models shine in OpenHands (e.g., GPT Codex), while Gemini Flash performs best in its native framework.
Bigger isn’t always pricier: Opus can cost half of Sonnet per solved task in one setup despite higher capacity.

Overall, even frontier models struggle with complex multimodal steps; success requires not only good code but also accurate visual reasoning.

05Discussion & Limitations

Limitations

Engine focus: Tasks are in Godot 4. Results may differ in Unity or Unreal, though Godot shares many concepts with them.
Code-centric agents: Experiments mostly used code editing; GUI action agents might unlock higher scores by directly manipulating the editor.
Pass@1 only: First-try scoring is strict; iterative replanning or human-in-the-loop could raise success.
Visual parsing: Agents often mis-pick sprites or mis-handle animations—today’s multimodal models still struggle with fine visual details.
Scope and assets: While diverse, tasks can’t cover every edge case (e.g., huge open worlds, online multiplayer). Some assets are complex (thousands of frames) and stress context limits.

Required Resources

Godot 4 installed, with command-line access.
The benchmark repositories (tasks and ground truth) and scripts for validation.
Optional MCP server for screenshots and setup to record short videos.
An agent framework (native CLI tools or OpenHands) and enough compute to process projects with many assets.

When NOT to Use

Pure creativity judging (e.g., “make it look cooler”) without exact, testable requirements.
Targets that depend on manual playtesting without a way to encode checks (e.g., long quests that need a human to watch).
Proprietary engines or closed assets where deterministic checks or tool access are limited.

Open Questions

Training: How do we teach models to reliably choose the correct sprites/frames and obey game patterns (signals, node trees)?
Tools: Can tighter editor integrations (dragging nodes, live inspectors) bring larger gains than screenshots/video alone?
Generalization: Will skills learned here transfer to other multimodal domains (robotics dashboards, design tools, AR/VR)?
Cost and efficiency: How do we reduce the token and tool cost of visual feedback while keeping its benefits?
Collaboration: What’s the best way to mix human hints and agent autonomy for faster, reliable game building?

06Conclusion & Future Work

Three-Sentence Summary: GameDevBench is a new benchmark that tests whether AI agents can really build game features inside a real engine, handling code, images, animations, and more. It uses deterministic Godot tests plus simple visual feedback (screenshots and videos) to fairly measure multimodal skill. Results show strong models still struggle, especially with graphics, but visual tools boost performance meaningfully.

Main Achievement: Turning game development into a rigorous, multimodal, behavior-checked benchmark—proving what agents can build, not just what they can type.

Future Directions: Improve agents’ visual reasoning for sprites, shaders, and timing; build richer editor tools; explore training on game-specific patterns; and expand to other engines or interactive GUI agents. More renewable tasks from fresh tutorials can keep raising the bar and track progress over time.

Why Remember This: It marks a shift from code-only challenges to real-world, multimodal building—and shows that even simple visual feedback can unlock big improvements. As agents get better here, they’ll likely get better at many mixed media tasks we care about, from app design to robotics dashboards.

Practical Applications

•Evaluate a new multimodal model by running it on GameDevBench tasks and tracking pass@1 across skill categories.
•Add MCP screenshots or short runtime videos to your agent loop to quickly improve success on visual tasks.
•Use the benchmark’s deterministic tests to regression-test your in-house coding agent as you update tools or prompts.
•Create a training curriculum for agents: start with gameplay logic tasks, then advance to 2D/3D graphics and UI.
•Diagnose failure modes by comparing results across editor types (scene vs. script vs. contextual).
•Prototype GUI-action agents that operate the Godot editor directly, then measure gains versus code-only editing.
•Benchmark cost-efficiency across frameworks (native vs. OpenHands) to pick the best setup for your budget.
•Fine-tune models on common game patterns (signals, node trees) observed as frequent error sources.
•Use task variants to test model sensitivity to small visual changes (e.g., choosing different animation frames).
•Automate continuous evaluation by renewing tasks from fresh tutorials, catching performance drift over time.

Version: 1