VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

Zirui Wang; Junyi Zhang; Jiaxin Ge; Long Lian; Letian Fu; Lisa Dunlap; Ken Goldberg; XuDong Wang; Ion Stoica; David M. Chan; Sewon Min; Joseph E. Gonzalez

VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

Intermediate

Zirui Wang, Junyi Zhang, Jiaxin Ge et al.1/23/2026

arXiv PDF

Key Summary

•VisGym is a playground of 17 very different visual tasks that test and train AI models that see and talk (Vision–Language Models) to act over many steps.
•It standardizes actions as function calls (like 'move', 'swap', 'rotate') and gives clear instructions and optional text feedback so models know what happened after each move.
•Across 12 leading models, even the best scored 46.61% on easy and 26.00% on hard tasks, showing multi-step visual interaction is still very tough.
•Long chat histories can hurt: performance follows a reversed-U shape, working best with just a few past steps (about four) and worse with unbounded history.
•Turning images into ASCII text often makes symbolic tasks much easier for models, revealing that perception, not reasoning, is the bottleneck in many cases.
•Removing textual feedback drops performance, meaning today’s models struggle to understand action results from images alone.
•Showing the final goal image at the start usually helps, but can backfire when models misjudge small visual differences (e.g., Zoom-In Puzzle).
•Fine-tuning on solver-made, multi-step demonstrations lifts performance a lot, especially when the demos reveal hidden state or unknown dynamics.
•Both the vision encoder and the language brain matter; history integration via the LLM is often the larger bottleneck, while fine-grained vision helps on perception-heavy tasks.
•VisGym gives concrete, controlled knobs (difficulty, history, feedback, modality) to diagnose failures like action looping, poor memory, and early stopping, and to chart clear paths for improvement.

Why This Research Matters

VisGym brings science-lab control to multimodal agents, letting us flip one switch at a time to see how perception, memory, and action truly work together. This will help build home robots that can actually finish chores step by step instead of getting stuck. It guides smarter app assistants that can navigate screens, fix mistakes, and adapt as interfaces change. In education, multi-step visual tutors can watch students’ steps and offer targeted hints, improving learning-by-doing. In healthcare and logistics, agents that track visual changes over time can reduce errors in procedures or warehouse operations. By turning vague failures into measurable dials—and showing how curated demonstrations fix them—VisGym speeds the path toward dependable, real-world AI helpers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how playing a new board game is hardest the first time because you must look at the board, remember past moves, and plan ahead? Now imagine doing that with your eyes and brain tied together so every move depends on what you see and remember.

🥬 Filling (The Actual Concept)

What it is: Vision–Language Models (VLMs) are AIs that look at pictures or video and read/write text to decide what to do next.
How it works:
1. They see an image or frame.
2. They read instructions.
3. They write a text action like 'move left' or 'swap pieces'.
4. They see what changed and repeat.
Why it matters: Without tying sight and language to action over time, models can answer single questions about images but fall apart when tasks need many steps. 🍞 Bottom Bread (Anchor) A VLM can point at a dog in one photo, but sorting a shuffled video back into order step by step is much harder.

🍞 Top Bread (Hook) Imagine following a recipe: crack eggs, then whisk, then cook—each step depends on what just happened in the pan.

🥬 The Concept: Multi-step decision-making is making a chain of correct moves where each new move depends on what you observed before.

How it works:
1. Observe current state (picture or frame).
2. Remember past moves and feedback.
3. Choose the next action.
4. Repeat until the goal is reached.
Why it matters: If a model forgets past steps or can’t read the scene, it repeats mistakes or quits early. 🍞 Anchor In a maze, you must remember dead ends to avoid looping.

🍞 Top Bread (Hook) You know how a good school test checks math, reading, and science—not just one subject?

🥬 The Concept: Cross-domain evaluation tests models on many different kinds of tasks so we know what they can really do.

How it works:
1. Create a mix of puzzles, navigation, manipulation, and real-image tasks.
2. Keep scoring rules consistent.
3. Compare models fairly across all tasks.
Why it matters: A model great at one game might be bad at another; only a varied test shows true strengths and weaknesses. 🍞 Anchor A student who aces only spelling but fails maps isn’t ready for geography class.

🍞 Top Bread (Hook) Think about telling a story to help your friend remember what’s going on in a long game.

🥬 The Concept: Context length (history) is how many past turns the model can see when choosing its next move.

How it works:
1. Keep recent observations, actions, and feedback.
2. Let the model read this history.
3. Truncate if it gets too long and noisy.
Why it matters: Too little history makes the model forget; too much clutters its memory. There’s a sweet spot. 🍞 Anchor Having just a few last messages in a chat helps, but pasting the whole year’s chat log can confuse you.

🍞 Top Bread (Hook) Imagine two ways to show a map: a picture, or a grid drawn with text characters.

🥬 The Concept: Observation modality is whether the model gets images or text (like ASCII) to describe the scene.

How it works:
1. Provide either an image or an ASCII layout.
2. Keep actions and goals the same.
3. Compare performance.
Why it matters: If text versions are easier, the problem might be vision (perception), not thinking (reasoning). 🍞 Anchor Solving a sliding puzzle from a clear text grid might be easier than from a blurry photo.

🍞 Top Bread (Hook) When you play a game, the rulebook says which moves are allowed.

🥬 The Concept: Action space is the set of moves the model is allowed to make.

How it works:
1. Define named actions (like 'rotate', 'swap', 'move') with parameters.
2. The model outputs a function call with those parameters.
3. The environment executes it and reports what happened.
Why it matters: Clear actions prevent confusion and let the model reuse skills across tasks. 🍞 Anchor In Jigsaw, 'swap((0,0),(1,1))' trades two pieces; in Maze, 'move(up)' steps one cell.

🍞 Top Bread (Hook) Think of a coach who says, 'Nice try, but that move was out of bounds.'

🥬 The Concept: Environment feedback is short text the environment returns to explain what an action did.

How it works:
1. After each action, send a message like 'executed', 'out of bounds', or 'invalid format'.
2. The model reads this along with the next image.
3. The model adjusts its plan.
Why it matters: Without feedback, models often can’t tell why something failed just from pixels. 🍞 Anchor When hitting a maze wall, 'blocked by wall' helps you turn instead of trying forward again.

🍞 Top Bread (Hook) Imagine trying to solve a mystery without seeing the whole scene—some clues are hidden.

🥬 The Concept: Partial observability means the model can’t see the full state at once and must remember or explore to uncover hidden parts.

How it works:
1. Only part of the world is visible each step (like a first-person maze).
2. Actions change what becomes visible.
3. Memory and exploration reveal the rest.
Why it matters: Without handling hidden information, the model gets lost or loops. 🍞 Anchor In 3D mazes, you only see the hallway ahead; you must remember turns you took.

🍞 Top Bread (Hook) Before this work, people had many single games but no fair, tunable playground to test multi-step visual thinking across many domains.

🥬 The Gap Filled

What it is: VisGym is a single, customizable 'gym' of 17 long-horizon, visual, interactive tasks with consistent interfaces.
How it works:
1. Standardize actions as function calls with instructions.
2. Provide visual or text observations and optional feedback.
3. Include oracle solvers to generate multi-step demonstrations for training.
4. Expose knobs for difficulty, history length, and goal visibility.
Why it matters: Now we can systematically test what helps or breaks models and train them with the exact experiences they need. 🍞 Anchor You can flip the 'feedback' switch off to test whether a model learns from images alone, or set history to 4 steps to see if that helps memory.

02Core Idea

🍞 Top Bread (Hook) Imagine a science lab where you can change one dial at a time—light, heat, or ingredients—to discover exactly what makes the experiment succeed.

🥬 The Concept: The Aha! is to build one unified, controllable gym where we can both test and train vision–language agents on multi-step visual tasks while toggling critical factors like history length, observation type, feedback, goals, and difficulty.

How it works:
1. Represent actions as function calls with parameters, so models use the same 'language' across tasks.
2. Provide clear function instructions and optional textual feedback after each step.
3. Supply oracle solvers that create multi-step demonstration trajectories for supervised fine-tuning.
4. Offer switches: image vs ASCII, short vs long history, feedback on vs off, goal provided vs not, easy vs hard.
Why it matters: With knobs to control each ingredient, we can pinpoint why models fail and how to fix them, not just that they failed. 🍞 Anchor Turn on 'final goal shown' for Jigsaw to test pure perception-and-tool use; turn it off to test planning to discover the goal.

🍞 Top Bread (Hook) Think of learning a sport: a coach tells you the legal moves, watches your play, and explains what went wrong so you improve.

🥬 The Concept: Function-conditioned action space unifies actions as named function calls like 'move(d)' or 'swap(i,j)' so language models can call tools naturally.

How it works:
1. Each environment advertises its menu of functions and valid argument ranges in text.
2. The model outputs a function name and parameters.
3. A step function parses, validates, executes, and returns visual and text feedback.
Why it matters: This lets one model transfer strategy across many tasks without learning a new button layout each time. 🍞 Anchor From Maze to Jigsaw, the model still speaks in 'function calls'—only the function names and arguments change.

🍞 Top Bread (Hook) When instructions are clear, you play better.

🥬 The Concept: Function instructions are plain-language tool manuals describing each action and its parameter formats.

How it works:
1. Include short, strict usage texts in the initial prompt.
2. Specify examples and argument rules.
3. Reject malformed actions and explain why.
Why it matters: Models stop guessing and start using tools correctly, reducing invalid moves. 🍞 Anchor 'rotate([dy,dp,dr])' means yaw, pitch, roll in that order, so the model doesn’t mix axes.

🍞 Top Bread (Hook) A replay coach who can perform perfect plays gives you the exact drills you need.

🥬 The Concept: Oracle multi-step solvers generate structured demonstrations that show how to solve tasks step by step.

How it works:
1. For each task, a heuristic or search-based solver completes episodes.
2. It can vary strategies or add reversible padding moves for diversity.
3. These trajectories become training data for supervised fine-tuning (SFT).
Why it matters: Models learn much faster from clean, multi-step examples than from random exploration. 🍞 Anchor In Matchstick Equation, BFS finds the shortest fix; DFS logs exploratory detours—both become teachable demos.

🍞 Top Bread (Hook) Sometimes seeing the answer first turns a puzzle into a matching game.

🥬 The Concept: Goal observation control lets us optionally show the final target image at the start to test perception vs planning.

How it works:
1. Provide the exact goal observation in the instructions.
2. The model aligns the current state to that target.
3. Compare with runs where the goal is hidden and must be constructed.
Why it matters: If performance jumps, perception-and-tool use is the bottleneck; if not, reasoning or action calling may be. 🍞 Anchor In Patch Reassembly, knowing the target picture helps you place patches correctly faster.

🍞 Top Bread (Hook) Carrying a huge backpack of notes can slow you down; keeping just the last few pages can help you focus.

🥬 The Concept: History truncation studies how many past turns to keep before the next action.

How it works:
1. Keep 1, 2, 4, or all prior turns.
2. Measure performance changes.
3. Find the sweet spot where info helps but clutter doesn’t hurt.
Why it matters: Today’s models often do best with just a few recent steps instead of the full story. 🍞 Anchor In Sliding Block, four turns of memory can help, but full chat history can distract and reduce success.

🍞 Top Bread (Hook) Reading a map vs seeing a photo—both describe the world, but in different ways.

🥬 The Concept: Observation modality toggles between images and ASCII text for the same task.

How it works:
1. Render the same state as a picture or as a text grid.
2. Keep actions and rewards identical.
3. Compare outcomes.
Why it matters: Big gains on ASCII mean perception is the choke point; little change means reasoning is the limit. 🍞 Anchor GPT-5 often does 3–4× better on text versions of symbolic tasks than on images, pointing at vision as the bottleneck.

🍞 Top Bread (Hook) When learning to ride a bike, small 'tell me what happened' comments speed up progress.

🥬 The Concept: Text feedback ablation tests whether models can learn consequences from pixels alone.

How it works:
1. Run tasks with and without textual feedback.
2. Compare success rates.
3. Analyze where pure visual learning fails.
Why it matters: Current VLMs rely heavily on text hints; removing them reveals perception-action gaps. 🍞 Anchor Without 'blocked by wall' messages, maze performance drops across models.

🍞 Top Bread (Hook) Explorers sometimes take small test steps to learn the terrain before a big move.

🥬 The Concept: Information-revealing demonstrations are curated demos that expose hidden states or unknown dynamics before solving.

How it works:
1. Insert unit moves that reveal scale or rotation effects.
2. Fully rotate along axes to expose 3D geometry.
3. Then execute the optimal solution.
Why it matters: These demos teach models how the world reacts, dramatically improving SFT results. 🍞 Anchor In Matchstick Rotation, doing two unit translations before the final aligning move boosted success from 32.9% to 70.0%.

03Methodology

High-level recipe: Input → Instructions + Observation + Allowed Actions → Model outputs a function call → Environment step validates and executes → New image + feedback → Repeat until 'stop' or step limit, then success or failure is scored.

🍞 Top Bread (Hook) Imagine a universal game controller where each button is a labeled tool you can call by name.

🥬 The Concept: Function-conditioned action interface standardizes how agents act.

What it is: Every move is a function call with a name and parameters, like 'swap((r1,c1),(r2,c2))' or 'rotate([dy,dp,dr])'.
How it works:
1. List available functions and argument ranges in plain text.
2. The model responds with exactly one function call per turn.
3. The environment parses and validates the call.
4. If valid, it executes and returns a new observation plus text feedback; if not, it returns 'invalid format/action'.
Why it matters: Consistency lets one model transfer patterns across domains and reduces action-format errors. 🍞 Bottom Bread (Anchor) In Maze 3D, the model alternates 'turn(left/right)' and 'move(0)' to follow a planned path.

Step-by-step details

Inputs at each turn

Observation: Image (or ASCII) of the current state.
Instructions: Task goal and function manuals.
History: A configurable number of past (observation, action, feedback) turns.
Why it exists: The model needs the goal, the tools, and enough memory to plan.
Example: 'Navigate to the red dot. Actions: move(0), turn(d), stop().' plus the current camera view.

Model picks an action

What happens: The model outputs a single function call string, e.g., 'move(0)' or 'swap((0,0),(1,1))'.
Why it exists: Forces precise, structured decisions that the environment can check.
Example: In Jigsaw, choosing 'reorder([...])' instead of many swaps.

Unified step function (the executor) 🍞 Hook Think of a referee who checks moves, applies them, and announces the result. 🥬 The Concept: A single step routine handles parse → validate → apply → feedback.

How it works:
1. Parse the model string into (action, payload).
2. Validate against the action's schema.
3. Apply to the environment, update state, set termination if 'stop'.
4. Return next image, reward (only at end), and text feedback.
Why it matters: Keeps rules consistent across all 17 tasks. 🍞 Anchor 'rotate([30,0,0])' on the cube updates yaw and returns 'executed'.

Feedback channel (optional)

What happens: The environment may send short text like 'out of bounds' or 'executed'.
Why it exists: Helps models connect actions to consequences when pixels are ambiguous.
Example: Sliding into an occupied cell returns 'invalid move'.

Termination and scoring

What happens: If the model outputs 'stop', the environment compares the current state to the goal and returns success or failure; otherwise, continue until step cap.
Why it exists: Encourages confidence—don't stop too early, don't dawdle.
Example: Zoom-In Puzzle requires exact order; early stopping while misordered yields failure.

The secret sauce ingredients 🍞 Hook Like a lab with adjustable knobs, tiny changes reveal big truths.

A) History truncation knob

What it is: Keep only 1, 2, 4, or all prior turns.
How it works: The environment prunes the chat history before each step.
Why it matters: Shows a reversed-U—small histories help, unbounded ones often hurt.
Anchor: In Sliding Block, 4 turns beat full history.

B) Observation modality knob

What it is: Swap the same state between image and ASCII text.
How it works: Render maps, boards, or equations as grids/characters.
Why it matters: Often, ASCII boosts success, exposing perception limits.
Anchor: GPT-5 jumps 3–4× on text versions of symbolic tasks.

C) Feedback on/off knob

What it is: Include or remove text feedback after actions.
How it works: The environment either emits messages or stays silent.
Why it matters: Performance drops without feedback, showing dependence on textual cues.
Anchor: Maze tasks suffer when 'blocked by wall' is removed.

D) Goal visibility knob

What it is: Provide the final goal image from the start.
How it works: Add the target image to instructions.
Why it matters: Usually boosts performance if perception-and-tool use is the limit; sometimes backfires when models misjudge visual equality.
Anchor: Helps in Jigsaw, hurts in Zoom-In Puzzle when the model thinks 'already identical' and stops early.

E) Difficulty knob

What it is: Change maze size, puzzle pieces, angular tolerances, etc.
How it works: Harder settings raise planning horizons or precision.
Why it matters: Separates easy successes from robust capabilities.
Anchor: Jigsaw 2×2 (easy) vs 3×3 (hard) widens gaps.

Training with VisGym 🍞 Hook Learning from a coach who shows play-by-play beats guessing.

🥬 The Concept: Supervised fine-tuning (SFT) on solver-generated trajectories teaches multi-step strategies.

What it is: Train models on sequences from oracle solvers, then evaluate on easy and harder variants.
How it works:
1. Generate diverse, successful demos; filter failures and prevent test leakage.
2. Train models (e.g., Qwen2.5-VL-7B) on single tasks or mixed tasks.
3. Optionally curate 'information-revealing' steps for partial observability or unknown dynamics.
Why it matters: SFT lifts success dramatically and shows how data curation shapes generalization. 🍞 Anchor Two unit test-moves before a final align in Matchstick Rotation doubled success relative to naive demos.

Module-wise fine-tuning 🍞 Hook Eyes and brain both matter; which should you train more?

🥬 The Concept: Vision-vs-LLM ablations reveal where gains come from.

What it is: Fine-tune just the vision encoder, just the LLM, or both.
How it works: Compare gains per task and observability.
Why it matters: LLM (temporal reasoning) often limits performance under partial observability; vision helps tasks needing fine detail. 🍞 Anchor Zoom-In Puzzle benefits more from vision; Maze 3D benefits more from LLM history integration.

04Experiments & Results

The test

What they measured: Task success rate (did the model finish correctly within the step cap) across easy vs hard settings and across many ablations (history, modality, feedback, goal visibility). They also tracked how many steps successful runs needed and cataloged failure patterns.
Why: To see not only how good models are, but why they fail and which toggles fix them.

The competition

12 state-of-the-art VLMs: proprietary (Gemini 3 Pro, Gemini 2.5 Pro, GPT-5, Claude Sonnet 4, Grok 4 Fast, Qwen-VL-Max), open-weight (Qwen3-VL-235B-Instruct, GLM-4.5V, Llama-4-Maverick, Qwen-2.5-VL-72B, Gemma 3-27B), and a GUI/game specialist (UI-TARS-1.5-7B). They also evaluated their own fine-tuned models trained on solver demos.

Scoreboard (with context)

Overall difficulty: Even the top model reached only 46.61% on easy and 26.00% on hard—like getting a B- on the easy version and a D on the hard one, across a very mixed test.
Specialization examples:
- GPT-5: Better at longer contexts and unknown-scale tasks like Matchstick Rotation; shows longer successful step tails.
- Gemini 2.5 Pro: Strong low-level perception—best on Jigsaw, Maze 2D, Zoom-In Puzzle, Sliding Block.
- Qwen3-VL: Strong at object localization—best on Referring Dot-Pointing.
Steps taken: Many models succeed in 3–5 steps, then success rates drop, signaling weak long-horizon handling.

Surprising findings and controlled diagnoses

History length (reversed-U):

Keeping a small window (about 4 turns) often helps, but giving the full unbounded history commonly reduces success.
Takeaway: Too much past distracts current reasoning; trim the memory backpack.

Observation modality (image vs ASCII):

For symbolic tasks (Matchstick Equation, Maze 2D, Patch Reassembly, Sliding Block), GPT-5 often improves 3–4× in ASCII.
Exception: Figlet-style ASCII can distort shapes; for Matchstick Equation, some models did better with images.
Takeaway: Perception is often the bottleneck; when perception is simplified to text, reasoning shines.

Remove textual feedback:

Performance drops across Maze 3D, Maze 2D, Sliding Block, Matchstick Equation.
Takeaway: Today’s models rely on explicit textual confirmations/errors; inferring consequences from pixels alone is weak.

Show the final goal at the start:

Big gains overall (e.g., Jigsaw, Patch Reassembly, Colorization), but can backfire (Zoom-In Puzzle, Matchstick Equation) due to visual misjudgment—models think 'already identical' and stop.
A follow-up check found Gemini 2.5 Pro incorrectly answered 'images are identical' 80% and 57% of the time for those two tasks, confirming the perception gap.

Fine-tuning with solver demos (SFT):

Fine-tuned models reached state-of-the-art on many tasks, proving learnability and the value of structured demos.
Stronger base model generalizes better: Qwen3-VL, trained with the same demos as Qwen2.5-VL, nearly doubled average success on unseen hard variants despite similar easy performance.
Vision vs LLM: Most tasks gained from tuning both; LLM (temporal reasoning) provided larger gains under partial observability or unknown dynamics; some perception-heavy tasks benefited more from vision.
Information-revealing demos: In Matchstick Rotation, adding two unit-scale probes before the final move boosted success from 32.9% to 70.0%. In 3D Mental Rotation (Objaverse), full-axis rotations reduced angular error and raised success; further training on solve-only data hurt, proving the structure—not just length—matters.

Common failure patterns (from automated analysis) 🍞 Hook When stuck, many players repeat the same move, forget warnings, or quit too early.

🥬 The Concept: Four recurring failure modes describe how models go wrong.

What it is: (1) Action looping, (2) State mismanagement, (3) Early termination, (4) Failure to use visual/spatial info.
How it works:
1. Cluster model traces using a VLM annotator.
2. Count failures per model/task.
3. Connect patterns to ablations (history, modality, feedback).
Why it matters: Naming failures turns mystery into measurable targets for improvement. 🍞 Anchor In Maze 2D, models often keep moving into a wall after 'blocked' feedback (looping + mismanagement).

Overall meaning of the numbers

VisGym exposes real, cross-domain weaknesses in long-horizon perception-action coupling.
Knobs like history truncation, ASCII rendering, textual feedback, and goal visibility sharply change outcomes—so careful interface design is as important as raw model size.
SFT with structured, information-revealing demos offers a reliable path to gains today, while highlighting the need for better visual grounding and memory strategies tomorrow.

05Discussion & Limitations

Limitations

Visual grounding gaps: Many tasks got much easier as ASCII, revealing that current perception encoders miss fine details or struggle with cluttered scenes.
Overreliance on feedback: Removing text feedback consistently hurt, showing insufficient ability to infer consequences from images alone.
Long-context fragility: Unbounded histories often reduce success; models struggle to filter stale or irrelevant past frames.
Brittle goal matching: Providing the final goal can cause early stopping if tiny differences are misjudged as identical.
Domain realism: While diverse, some tasks are still simplified simulations; bridging to messy real-world sensor noise remains ongoing.

Required resources

Compute: Multi-turn vision + language inference is token- and pixel-heavy; long episodes can be costly, especially with large proprietary models.
Data: SFT needs many clean, multi-step demos; information-revealing curation improves outcomes but adds design effort.
Engineering: Integrating custom tasks requires action schemas, instructions, and solvers; VisGym streamlines this with a unified step function but still needs task-specific setup.

When NOT to use VisGym

Pure single-shot VQA or captioning without interaction—simpler static benchmarks fit better.
Non-visual tool-use agents (e.g., text-only code assistants)—VisGym’s visual interactivity isn’t needed.
Reinforcement learning without function-call actions—if your agents require continuous low-level torques, you may need raw-control simulators.

Open questions

Better long-context strategies: How to summarize visual histories so more is helpful, not harmful?
Perception–action causality: Can models learn consequences directly from pixels without textual hints?
Robust goal checking: How to avoid 'looks identical' mistakes—perceptual diff tools, uncertainty estimates, or learned verifiers?
Curriculum and data curation: What is the optimal schedule of information-revealing probes for different kinds of hidden-state tasks?
Transfer to the wild: How do VisGym-trained skills carry over to noisy real robots, GUIs, or egocentric videos without domain shortcuts?

06Conclusion & Future Work

Three-sentence summary VisGym is a unified, controllable gym of 17 visual, interactive, long-horizon tasks that evaluates and trains vision–language agents with function-call actions, optional text feedback, and oracle-generated demonstrations. Systematic toggles—history length, observation modality, feedback, and goal visibility—reveal why models fail (looping, weak memory, perception gaps) and how to help them (truncate histories, add feedback, use goal images carefully, and curate information-revealing demos). Fine-tuning on solver trajectories yields strong gains, especially when demos expose hidden states or unknown dynamics, charting a clear path toward more capable multimodal agents.

Main achievement A single, extensible framework that makes multi-step visual decision-making measurable, diagnosable, and trainable across domains—turning vague 'model struggles' into concrete, fixable failure modes with proven remedies.

Future directions

Build better visual grounding via stronger encoders and learned verifiers for goal equality.
Develop memory mechanisms that keep helpful summaries but drop stale history.
Expand information-revealing curricula and combine with online RL for closed-loop learning.
Scale domains (harder puzzles, larger mazes, richer manipulation) and add real-world sensors.

Why remember this VisGym shows that careful interfaces and curated multi-step experiences matter as much as bigger models: with the right knobs and demos, we can systematically turn long-horizon visual action from 'mystery' into 'mastery'.

Practical Applications

•Design curriculum SFT: start with solver demos that include small probes (e.g., unit moves, full-axis rotations), then progress to solve-only data.
•Adopt function-call actions in new environments so language models can transfer tool-use skills across tasks.
•Use history truncation (e.g., last 2–4 turns) as a default to avoid long-context degradation in interactive agents.
•Keep textual feedback on during training and evaluation to stabilize learning; gradually wean off to improve pixel-based causality.
•Provide goal observations for perception-dominant tasks, but add robust visual-diff checks to avoid false 'identical' judgments.
•Switch to ASCII/text renderings for symbolic tasks during early training to bypass perception bottlenecks and teach planning first.
•Run failure-mode audits (looping, mismanagement, early stopping, ignoring visuals) to prioritize fixes and measure progress.
•Fine-tune both the LLM (for temporal reasoning) and the vision encoder (for fine details), prioritizing LLM under partial observability.
•Benchmark new models across all 17 VisGym tasks to find domain strengths/weaknesses before deployment.
•Automate demo generation with oracle solvers to scale high-quality, diverse training data without manual labels.

Version: 1