KAGE-Bench: Fast Known-Axis Visual Generalization Evaluation for Reinforcement Learning

Egor Cherepanov; Daniil Zelezetsky; Alexey K. Kovalev; Aleksandr I. Panov

KAGE-Bench: Fast Known-Axis Visual Generalization Evaluation for Reinforcement Learning

Intermediate

Egor Cherepanov, Daniil Zelezetsky, Alexey K. Kovalev et al.1/20/2026

arXiv PDF

Key Summary

•KAGE-Bench is a fast, carefully controlled benchmark that tests how well reinforcement learning (RL) agents trained on pixels handle specific visual changes, like new backgrounds or lighting, without changing the actual game rules.
•It is built on KAGE-Env, a JAX-native 2D platformer that separates what you see (the renderer) from how the world moves and rewards you (the hidden game), so only visuals change between training and testing.
•The benchmark defines six visual axes—agent appearance, background, distractors, filters, effects, and layout—and creates 34 train–eval pairs that change exactly one axis at a time.
•A standard PPO-CNN agent shows big, axis-dependent failures: backgrounds, photometric filters, and lighting often crush success rate, while changing the agent’s look is usually fine.
•Some visual shifts let agents keep moving forward (good distance) but still fail the task (low success), proving that return alone can hide serious problems.
•KAGE-Env runs fully on GPU/TPU with JAX (jit, vmap, scan) and reaches up to ~33 million environment steps per second, enabling fast, reproducible studies.
•The paper formalizes that a visual change is equivalent to changing the state-conditional action distribution induced by the pixel policy (the “induced state policy”), so failures can be blamed squarely on the visuals, not the hidden dynamics.
•By reporting distance, progress, and success (not just return), KAGE-Bench makes it easier to see partial progress versus true completion.
•This benchmark helps the community pinpoint which visual factors break agents and test fixes like better augmentations or architectures much faster.
•Overall, KAGE-Bench turns messy visual robustness testing into clean, scalable, and scientifically precise diagnosis.

Why This Research Matters

Real robots, cars, and virtual assistants must work when the picture changes but the job stays the same—think day vs night, clean lab vs cluttered home, or camera noise and glare. KAGE-Bench lets us pinpoint exactly which visual changes break agents and how badly, instead of guessing. Because it runs extremely fast, researchers can test many ideas (augmentations, architectures, training curricula) quickly and fairly. Reporting distance, progress, and success ensures we notice when agents “move” but still don’t “finish,” which is crucial for safety and reliability. This kind of precise, scalable diagnosis pushes the field toward agents that are robust in the real world, not just in their training visuals.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 You know how a friend can still recognize you even if you wear a different jacket or stand under different lights? People are good at ignoring small visual changes and focusing on what matters. But many RL agents that look at pixels aren’t so forgiving—they can get very confused when the picture looks different, even if the game itself hasn’t changed.

🥬 The Concept: Reinforcement Learning (RL)

What it is: RL is a way for AI to learn by doing actions and getting rewards, like a player learning a game by trying moves and seeing scores.
How it works:
1. See an observation (like a game frame).
2. Pick an action.
3. Get a reward and a new observation.
4. Repeat, learning which actions lead to higher rewards over time.
Why it matters: Without RL, agents can’t adapt their behavior by experience; they’d just guess blindly. 🍞 Anchor: Think of a kid learning to ride a bike—try, wobble, adjust, and keep going longer before falling. That’s RL.

🍞 Imagine you’ve memorized your school’s hallway by the posters on the wall, and suddenly the posters change; you might get lost even though the hall itself is the same.

🥬 The Concept: Visual Distribution Shift

What it is: A visual distribution shift is when the appearance of inputs (e.g., colors, textures, lighting) changes while the hidden rules of the world stay the same.
How it works:
1. Train on one set of looks (say, black background).
2. Test on another look (say, bright, noisy background).
3. The agent’s vision is the only thing different; the game rules aren’t.
Why it matters: If agents depend on trivial visual cues (like a color patch), they’ll fail when looks change, even if the task is identical. 🍞 Anchor: A self-driving toy car trained in sunshine might swerve at dusk unless it learned the real rules of the road, not just sunny pictures.

🍞 Think of a science test where the study guide mixes questions about math and history together—you’ll struggle to know which part you didn’t understand.

🥬 The Concept: Entangled Benchmarks

What it is: Many old benchmarks change multiple things at once (appearance, layout, dynamics), making it hard to blame failures on a single cause.
How it works:
1. Use procedural generation or mixed changes.
2. Train and test on different seeds.
3. Performance gaps could come from many sources at once.
Why it matters: Without clean isolation, we can’t diagnose which visual factor truly breaks the agent. 🍞 Anchor: If a math test also changes the language mid-exam, you can’t tell if errors were math or reading.

🍞 Picture playing a side-scrolling game: you only see the screen image, but under the hood the game has rules about gravity, jumps, and scoring.

🥬 The Concept: Visual POMDP

What it is: A visual POMDP is a decision problem where you only see pixels (partial information) generated by a hidden state and fixed rules.
How it works:
1. Hidden state evolves by fixed dynamics.
2. A renderer turns the hidden state into an image.
3. The agent picks actions from the image.
Why it matters: It separates what changes (images) from what doesn’t (rules), so we can fairly test visual robustness. 🍞 Anchor: A platformer where you always have the same gravity and level goal, but the background art can change.

The world before this paper looked like this: RL agents trained on pixels often looked great on the training visuals but stumbled badly when the visuals changed a bit—new textures, different backgrounds, altered lighting, or moving distractions—despite identical dynamics and rewards. Researchers tried better representations and augmentations, but evaluations were slow and messy. Benchmarks mixed many shifts at once, so if an agent failed, we couldn’t tell whether backgrounds, lighting, or layout were to blame. Also, many simulators were CPU-bound and slow, making broad sweeps over visual parameters impractical.

The gap: We needed a benchmark that (1) changes only visuals (not rules), (2) toggles one visual factor at a time, (3) runs very fast for large studies, and (4) reports metrics that reveal partial progress versus true success. That’s exactly what KAGE-Env and KAGE-Bench deliver.

02Core Idea

🍞 Imagine a music mixer with separate sliders for bass, treble, and vocals. If the sound gets worse when you only slide the treble, you know treble is the culprit.

🥬 The Concept: Known-Axis Visual Generalization

What it is: Change one visual factor (axis) at a time—like background, filters, lighting—while keeping the hidden game rules fixed, to see exactly which visual change breaks the agent.
How it works:
1. Split visuals into independent axes (backgrounds, sprites, distractors, filters, effects, layout).
2. Create a train–eval pair that differs in exactly one axis.
3. Train a pixel-based policy on train; evaluate on eval.
4. Any performance gap must be from that axis.
Why it matters: Clean cause-and-effect lets us target real fixes (e.g., better augmentation for lighting) instead of guessing. 🍞 Anchor: If a student only struggles when the lighting is dim, you adjust the lighting, not the math textbook.

🍞 You know how a cartoon scene can be drawn in many styles (noir, neon, vintage), but the story’s plot stays the same? That’s the trick here.

🥬 The Concept: KAGE-Env

What it is: A JAX-native 2D platformer that factorizes the observation (rendering) process into independently controllable visual axes while keeping dynamics and rewards fixed.
How it works:
1. Define the hidden game (physics, rewards) once.
2. Expose 93+ visual parameters via a YAML config.
3. Render pixels from hidden states using those parameters.
4. Use JAX (jit, vmap, scan) to batch and accelerate everything on one GPU/TPU.
Why it matters: It gives a clean lab to test visual changes fast, without touching the task itself. 🍞 Anchor: One platformer level, many art styles, identical jump physics.

🍞 Think of a lab test kit with labeled vials. You test each vial separately so you know exactly what caused the reaction.

🥬 The Concept: KAGE-Bench

What it is: A benchmark built on KAGE-Env with six axis suites and 34 train–eval pairs that each isolate a single visual shift.
How it works:
1. Curate pairs where only one axis differs.
2. Train with several seeds using a standard PPO-CNN.
3. Evaluate on both train and eval configs.
4. Report distance, progress, success, and return, plus gaps.
Why it matters: It standardizes fair, axis-specific testing so results are comparable and diagnostic. 🍞 Anchor: A test set where only the background changes between two versions of the same level.

🍞 Imagine translating a muffled announcement into clear words: you remove the speaker’s voice style and focus on the meaning.

🥬 The Concept: Induced State Policy (Key Insight)

What it is: When visuals change, a fixed pixel policy plus the new renderer produce a different action distribution per hidden state—the induced state policy.
How it works:
1. Hidden state → image (by renderer choices).
2. Image → action (by the same pixel policy).
3. Average over images to get action distribution given state.
Why it matters: This proves any performance gap comes strictly from visuals changing the policy’s effective behavior, since the task rules didn’t change. 🍞 Anchor: Same crossroads, different weather filters on the camera; the driver’s choices change only because the camera view changed.

🍞 You don’t judge a marathoner only by total points—they might get points for showing up! You check distance and whether they finished.

🥬 The Concept: Trajectory-Level Metrics

What it is: Besides return, measure distance traveled, normalized progress, and success (task completion) to see partial vs full competence.
How it works:
1. Track horizontal position over time.
2. Compute total distance and progress toward the goal.
3. Mark success if the finish line is reached.
Why it matters: Return can hide failures; these metrics reveal if agents move but still can’t complete tasks under new visuals. 🍞 Anchor: A runner covers 90% of the course (good progress) but never crosses the finish (no success).

🍞 Think of moving from a hand mixer to a stand mixer—same recipe, but way faster and more consistent.

🥬 The Concept: JAX Vectorization (jit, vmap, scan)

What it is: Compile and batch environment steps on accelerators to simulate tens of thousands of environments in parallel.
How it works:
1. Write pure JAX functions for reset/step.
2. vmap them over many envs.
3. jit compile for fused, fast execution; use scan for loops.
Why it matters: Up to ~33M steps/sec on one GPU makes exhaustive, reproducible visual sweeps practical. 🍞 Anchor: Testing 65k copies of the same level at once, each with a different background.

Before vs After:

Before: Messy, slow evaluations where changes were tangled and causes unclear.
After: A fast, clean, axis-by-axis benchmark that nails down which visual knobs break agents and by how much.

Why it works: By freezing the latent control problem and only touching the renderer, and by proving visual shifts equal induced policy shifts, KAGE-Bench ensures any gap reflects perception issues—not dynamics or rewards. Add scalable simulation and richer metrics, and the diagnosis becomes both sharp and fast.

03Methodology

At a high level: Input (YAML visual config + fixed hidden task) → KAGE-Env renders pixels → PPO-CNN trains on pixels → Evaluate on matched train–eval pairs that differ in one axis → Output metrics (distance, progress, success, return) and gaps.

🍞 Think of filling out a form to choose a costume for a play: same script, different costumes.

🥬 The Concept: YAML Configuration

What it is: A single file that sets all visual parameters (backgrounds, sprites, filters, effects, distractors, layout) while leaving hidden rules fixed.
How it works:
1. Choose assets (e.g., 128 backgrounds, 27 sprite skins, shape colors).
2. Set photometric knobs (brightness, contrast, hue, saturation, gamma, blur, noise, pixelation, vignette).
3. Set lighting/overlay effects (point lights: count, intensity, radius, color; radial light strength).
4. Optional layout tweaks; in KAGE-Bench axis pairs, only one visual axis changes.
Why it matters: One source of truth for visuals makes axis-isolated experiments simple and reproducible. 🍞 Anchor: Flip “background: black” to “background: noise” and everything else stays the same.

🍞 Imagine running the same game on 65,536 screens at once, each with a different wallpaper.

🥬 The Concept: JAX Vectorization (jit, vmap, scan) in Practice

What it is: Run reset/step for huge batches in parallel on GPU/TPU.
How it works:
1. Define env.reset and env.step as pure JAX functions.
2. vmap over N environments to batch them.
3. jit compile to fuse ops and eliminate Python overhead.
4. Use scan to unroll time steps efficiently.
Why it matters: Allows up to ~33M steps/sec and quick sweeps over many visual configs. 🍞 Anchor: A factory line where thousands of identical toys are tested at once.

🍞 Think of a coach that watches one frame at a time and suggests the next move, learning what works best over many games.

🥬 The Concept: PPO-CNN Baseline

What it is: A standard setup combining PPO (a stable policy gradient method) with a CNN (good at images) to map pixels to actions.
How it works:
1. CNN encodes the image into features.
2. Policy head outputs action probabilities; value head estimates returns.
3. PPO updates the policy with clipped gradients to avoid wild swings.
4. Repeat over many steps, with periodic evaluation.
Why it matters: A strong, commonly used baseline makes results comparable to prior work. 🍞 Anchor: A tried-and-true recipe for training a pixel game agent.

Step-by-step recipe for KAGE-Bench:

Fix the latent control problem.

Hidden physics (e.g., gravity, jump), rewards, and level horizon T are constant.
Reward encourages first-time forward progress and penalizes time, jumping, and idling.
Why: Keeps the decision-making task identical across visuals.
Example: The finish line distance D is the same no matter the background.

Choose an axis and create a train–eval pair.

Pick one visual axis (e.g., background).
Set train config (e.g., black background); set eval config (e.g., purple background) while keeping everything else fixed.
Why: Guarantees any performance gap is caused by that axis.
Example: Agent sprite, physics, and rewards remain unchanged.

Train PPO-CNN on the train config.

Run multiple seeds (e.g., 10) to average out randomness.
Log metrics periodically on both train and eval configs.
Why: RL can be noisy; multiple seeds give stable conclusions.
Example: Every 100k steps, test on black (train) and purple (eval).

Aggregate with the “maximum-over-training” protocol.

For each seed and config, take the best (max) metric achieved during training.
Average those maxima over seeds, then over pairs within the axis.
Why: Generalization can peak at different times; this avoids cherry-picking.
Example: If best eval success appeared at 1.4M steps for seed 3, that’s the number used for that seed.

Report trajectory-level metrics plus return.

Distance: how far the agent traveled.
Progress: distance normalized by goal distance.
Success: 1 if the agent reached the goal; else 0.
Return: shaped reward total.
Why: Return can mask completion failures; success reveals them.
Example: An agent may keep moving but miss jumps under harsh lighting.

Scale and sweep.

Use JAX speed to sweep across many parameter values per axis (e.g., hue shift angles, light counts, distractor counts).
Why: Dose–response curves show how performance drops as visuals change more.
Example: Background colors added cumulatively: black → black+white → +red → +green → +blue.

Secret sauce

Axis isolation: Only one visual knob changes; dynamics/rewards are invariant.
Formal reduction: Visual shift = induced policy shift, so failures are strictly due to perception.
Vectorized speed: High throughput enables comprehensive, robust evaluation.
Rich metrics: Distance/progress/success disentangle motion from completion. Together, these make KAGE-Bench precise, fast, and revealing.

04Experiments & Results

🍞 Suppose two students run the same obstacle course. One course is filmed in daylight (train), the other at night (eval). If the runner slows only at night, we know the lighting is the issue.

🥬 The Concept: Generalization Gap

What it is: The difference between performance on the train visuals and the eval visuals for the same task.
How it works:
1. Train PPO-CNN on the train config.
2. Evaluate on both train and eval configs.
3. Subtract eval from train to get the gap for each metric.
Why it matters: A large gap means the agent didn’t truly learn the task—it learned the training look. 🍞 Anchor: A spelling whiz who only aces tests with the teacher’s favorite font.

The test

What they measured: Distance, progress, success rate (SR), and return, each summarized via the maximum-over-training approach across 10 seeds.
Why these metrics: Success highlights completion; distance/progress show partial competence; return alone can mislead.

The competition

Baseline: A standard PPO-CNN (from CleanRL-style implementation) trained on single RGB frames.
Comparisons: Not to other algorithms, but across visual axes and difficulty sweeps to see which axes are most brittle.

The scoreboard—with context

Filters (photometric changes like brightness/contrast/hue) and Effects (lighting like point lights, radial light): biggest trouble.
- Filters: success plummets from ~0.83 to ~0.11 (∆SR ≈ 86.8%). That’s like going from an A to a near-zero.
- Effects: success drops from ~0.82 to ~0.16 (∆SR ≈ 80.5%). Again, severe.
- Context: Distance drops only modestly (≈12–21%), so agents still move but fail to finish—return can look “okay” while success collapses.
Background changes: also harmful.
- Success falls from ~0.90 to ~0.42 (∆SR ≈ 53.3%). Like going from reliably finishing to finishing less than half the time.
- Distance/progress drop by ~30%. Adding more colors to the background steadily worsens eval success (clear dose–response).
Layout and Distractors: medium damage.
- Layout: ∆SR ≈ 62.8% despite small distance gaps (~4%). Agents move but can’t complete.
- Distractors: ∆SR ≈ 30.9%, again with small distance gaps. More lookalike distractors steadily suppress eval success.
Agent appearance: comparatively benign.
- ∆SR ≈ 21.1%. Changing the avatar’s look matters less than messing with the world’s look.

Surprising findings

Motion without mastery: Several shifts preserve forward motion (distance) while breaking completion (success). Return alone hides this—hence the need for trajectory metrics.
Dose–response clarity: Gradually adding background colors or increasing distractor counts shows smooth, monotonic drops in eval success, while train success stays high.
Diversity can help: Training with more varied backgrounds reduced some gaps in background tests, hinting that targeted augmentations might be effective.

Concrete examples

Black → noise background: near-total failure (∆SR ~ 99%).
Hue shift by 180°: extreme collapse (∆SR ~ 98.8%).
Light count 4: very large failure (∆SR ~ 95.5%).
7 same-as-agent distractors: major drop (∆SR ~ 92.0%).

Takeaway

Visual robustness is axis-dependent. PPO-CNN is strong in-distribution but brittle under controlled visual shifts—especially photometric and lighting changes and busy backgrounds. Reporting distance, progress, and success exposed failures that return alone would have hidden.

05Discussion & Limitations

🍞 If you only ever practice basketball on one court, you might miss how different lighting or floor color can throw you off in a new gym.

🥬 The Concept: When Not to Trust Return Alone

What it is: Return is a single score that can mix many effects and hide failures to actually complete the task.
How it works:
1. Agents get shaped rewards for partial progress.
2. Even if they never finish, they can still score points.
3. Return might look fine while success is near zero.
Why it matters: You could wrongly think your agent generalized when it only learned to shuffle forward. 🍞 Anchor: A runner racks up steps but never crosses the finish line—distance is up, success is not.

Limitations

Scope: Focused on visual generalization with known axes in a 2D platformer. Real robots may face 3D geometry, occlusions, and sensor quirks beyond these axes.
Pixel-only policies: Results center on reactive pixel policies; memory or state-estimation methods might behave differently.
Single-task domain: Although ideal for diagnosis, it does not cover all task families (e.g., manipulation, driving) yet.
Asset coverage: While assets are diverse (backgrounds, sprites), they’re still curated; real-world data is more chaotic.

Required resources

Hardware: To enjoy the full speed (millions of steps/sec), you need a modern GPU/TPU. It still runs on CPU, but slower.
Software: JAX stack familiarity helps (jit, vmap, scan) to get reproducible, vectorized runs.

When not to use

If you need coupled changes (e.g., visuals plus layout plus physics) by design, KAGE-Bench’s isolation may be too strict for your purpose.
If your agent uses privileged state (not pixels), the visual axes won’t stress it meaningfully.

Open questions

How do memory, attention, or object-centric models change axis sensitivity?
Which targeted augmentations best close gaps per axis (e.g., hue jitter for filters, glare augmentation for effects)?
Can similar known-axis ideas scale to 3D, partial occlusions, or real robot cameras?
What’s the best checkpointing/evaluation strategy beyond maximum-over-training for fair, stable comparisons across methods?
How do we balance diversity (to help generalization) with train-distribution realism to avoid over-randomization?

06Conclusion & Future Work

Three-sentence summary

KAGE-Env cleanly separates visuals from the hidden task, and KAGE-Bench uses it to change one visual axis at a time, creating 34 precise train–eval pairs.
A standard PPO-CNN shows strong axis-dependent brittleness: backgrounds, photometric filters, and lighting often crush success while agent appearance shifts are milder.
Fast JAX vectorization enables up to ~33M steps/sec, supporting large, reproducible sweeps and richer metrics (distance, progress, success) that reveal failures return would hide.

Main achievement

Turning messy visual robustness testing into a clean, fast, and formally justified diagnosis tool by proving that visual shifts reduce to induced state-policy shifts and by isolating axes by construction.

Future directions

Expand to broader tasks (manipulation, navigation, driving) and 3D scenes; explore memory and object-centric models; design axis-targeted augmentations and curricula; standardize evaluation beyond maximum-over-training.

Why remember this

KAGE-Bench shows exactly which visual knobs break pixel-based RL and how badly, at unprecedented speed. It makes robust perception less guesswork and more science, helping the community build agents that keep their cool when the picture changes but the task does not.

Practical Applications

•Benchmark new RL architectures for visual robustness by testing axis-by-axis where they fail or succeed.
•Design targeted data augmentations (e.g., hue jitter for filter sensitivity) and validate their impact quickly.
•Build curricula that gradually increase visual difficulty (dose–response schedules) and measure transfer.
•Evaluate how memory or attention modules change sensitivity to distractors, lighting, or backgrounds.
•Stress-test agents before deployment by simulating camera noise, pixelation, or lighting glare conditions.
•Run hyperparameter sweeps at scale to find stable training settings that generalize under known axes.
•Teach students visual generalization concepts with a fast, easy-to-configure platformer and clear metrics.
•Compare methods fairly using standardized suites and metrics that separate motion from completion.
•Prototype perception modules (e.g., object-centric encoders) and measure which axes they improve.
•Automate regression tests: lock in visual robustness targets and flag performance drops in CI pipelines.

Version: 1