Language-based Trial and Error Falls Behind in the Era of Experience
Key Summary
- •Big language models are great at words but waste lots of time and energy when they try random actions in non-language games like Sudoku, Sokoban, 2048, FrozenLake, and Rubik’s Cube.
- •This paper’s idea is simple: let tiny, fast 'scout' models do the messy exploring first, then teach the big model what they learned.
- •SCOUT has three stages: scouts explore, their experiences get turned into text to teach the LLM (SFT), and then the LLM polishes its skills with multi-turn RL.
- •With SCOUT, a small open model (Qwen2.5-3B-Instruct) scored 0.86 on average, beating strong proprietary models like Gemini-2.5-Pro (0.60).
- •Training cost drops by about 60% GPU hours because exploration is offloaded to tiny CPU-friendly scouts.
- •The method works across many unseen symbolic and spatial tasks, including very long ones like 2048 and 3D ones like Rubik’s Cube.
- •Sometimes SFT alone gives big jumps (e.g., Rubik’s Cube), and sometimes RL 'activates' hidden skills after SFT (e.g., Sudoku from 0.29 to 0.97).
- •In multi-task training, SCOUT keeps old skills while adding new ones, avoiding catastrophic forgetting and reaching about 0.91 average.
- •The key insight is decoupling exploration (cheap, fast, small models) from exploitation (smart reasoning by the LLM) so each part does what it’s best at.
Why This Research Matters
Many useful problems aren’t sentences—they’re buttons, grids, rules, and maps. SCOUT makes AI better at these by letting tiny models do the heavy exploration and big models do the smart planning. This cuts training cost by about 60% while boosting accuracy, so more people and labs can build capable agents. It also helps agents keep old skills while learning new ones, which is vital for assistants that must grow over time. Better symbolic and spatial reasoning means stronger tools for spreadsheets, robotics, logistics, and education. In short, SCOUT turns expensive guesswork into efficient learning, bringing powerful problem-solving closer to everyday use.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you’re learning a new board game with weird rules written in symbols, not words. If you guess moves at random, you’ll use up time and energy and still play badly.
🥬 The Concept: Large Language Models (LLMs) are amazing at language tasks but struggle with unfamiliar, non-language worlds like grid puzzles and movement games, because they try to explore by talking—very slowly and expensively. How it works (before this paper):
- LLMs are pretrained on tons of text, so they’re strong at chat, stories, and many reasoning tasks tied to words.
- When dropped into symbolic or spatial tasks (e.g., Sokoban, Sudoku, Rubik’s Cube), they must explore by generating text tokens to choose actions—one slow, costly step at a time.
- This creates a mismatch: the model searches a huge vocabulary space while the game only needs a few simple actions (like Up, Down, Left, Right). Why it matters: Without a better plan, LLM agents waste compute doing trial-and-error, learn slowly, and often never master the game’s rules.
🍞 Anchor: Think of using a skyscraper-sized computer to play tic-tac-toe by writing essays about each move. It works, but it’s silly and slow.
🍞 Hook: You know how in science class we try small experiments first to learn the rules before doing a big project?
🥬 The Concept: Reinforcement Learning (RL) is a way for agents to learn by trying actions and getting rewards or penalties. How it works:
- The agent sees a state (like a grid).
- It picks an action (like move right).
- It gets a reward (good/bad) and a new state.
- Over many tries, it learns what actions lead to success. Why it matters: RL can learn the hidden rules of an environment without needing them explained in words.
🍞 Anchor: Like a puppy learning tricks with treats—more treats for good tricks, fewer for bad ones.
🍞 Hook: Imagine you’ve watched a skilled friend solve 100 puzzles. Now you can copy their steps to warm up before trying your own strategies.
🥬 The Concept: Supervised Fine-Tuning (SFT) teaches a model by showing correct examples and having it imitate them. How it works:
- Collect expert examples (state → action).
- Turn them into training pairs the model can read.
- Train the model to predict the expert’s action from the given state. Why it matters: SFT quickly gives the model solid basics, so it doesn’t start from zero.
🍞 Anchor: Like practicing piano by copying your teacher’s hand positions on beginner songs.
🍞 Hook: Think of a playground: if you understand which slides are slippery and which aren’t, you can plan smart moves.
🥬 The Concept: Environmental dynamics are the rules that say how actions change the world (like slipping on ice in FrozenLake). How it works:
- Observe how states change after each action.
- Notice patterns (push a box into a corner = stuck!).
- Use these patterns to plan future moves. Why it matters: Without knowing the dynamics, the agent can’t plan well.
🍞 Anchor: In Sokoban, learning not to push a box into a corner saves the game.
🍞 Hook: When you try a new restaurant, you face a choice—explore a new dish or exploit your favorite.
🥬 The Concept: Exploration-exploitation decoupling means letting different parts of the system specialize: one explores cheaply, another exploits wisely. How it works:
- A small, fast model explores the world to map the rules.
- It collects great examples.
- The big language model learns from those, then focuses on smart planning. Why it matters: This avoids wasting the big model’s power on random guessing.
🍞 Anchor: Send scouts ahead on the trail to find safe paths; then the main group follows the best route.
🍞 Hook: Teams use specialists: sprinters to scout the field, strategists to plan the game.
🥬 The Concept: Scout networks are tiny neural nets (like small MLPs/CNNs) trained with RL to quickly explore and master task rules. How it works:
- Train scouts with RL (e.g., DQN, PPO) on raw symbolic states.
- Generate many expert trajectories (state, action, reward sequences).
- Convert them into text the LLM can read and learn from. Why it matters: Scouts run fast on CPUs, saving time and GPU money while finding what works.
🍞 Anchor: Like sending speedy bikers to map a race course before the coach trains the whole team.
The world before: Many agent papers tried to push LLMs harder with better prompts, memory, or longer context. They helped in language-heavy tasks but often failed in symbolic/spatial ones: the model didn’t know the environment’s physics. The problem: Massive exploration with a token-generating LLM is slow and pricey. Plus, the action space is tiny while the model’s output space is huge. Failed attempts: Direct multi-turn RL on LLMs (no warm-up) often burns compute, learns unstable policies, and still underperforms. The gap: We needed a way to learn task rules fast, cheaply, and then transfer them to the LLM. Real stakes: This touches daily life—robotics, logistics, spreadsheets, UI automation, and educational games all have symbolic or spatial rules. Making agents good at these without huge compute bills means smarter assistants for everyone.
02Core Idea
🍞 Hook: You know how a school play works best when the stage crew sets everything up before the actors arrive? The actors then shine because the hard prep is done.
🥬 The Concept: The key insight is to split learning into two jobs—scouts do fast, cheap exploration; the LLM learns from those experiences and then fine-tunes itself with multi-turn RL to make great decisions. How it works:
- Train tiny scout models with RL to master the task’s rules and collect expert trajectories.
- Convert those trajectories into dialogue-style text (textualizer) and do Supervised Fine-Tuning (SFT) to "warm up" the LLM.
- Run multi-turn RL on the LLM to refine and activate deeper planning and reasoning. Why it matters: The LLM avoids expensive random trial-and-error and focuses on what it’s best at: reasoning and generalization.
🍞 Anchor: Like letting assistants set up a science fair project so the presenter can focus on the explanation and win the ribbon.
Multiple analogies:
- Sports team: Scouts watch opponents and gather plays; the coach (LLM) studies the footage (SFT) and then practices drills (RL) to win.
- Cooking: A sous-chef (scout) preps ingredients and tests flavors; the head chef (LLM) assembles the final dish and adjusts seasoning (RL).
- Space mission: Drones (scouts) map terrain; mission control (LLM) plans the route and adapts in real time.
Before vs After:
- Before: LLM explores by generating many tokens, slow and costly; performance modest on non-linguistic tasks.
- After: Scouts explore quickly and cheaply; the LLM starts from strong examples, then polishes with RL, reaching top scores at far lower cost.
Why it works (intuition, no equations):
- Sample efficiency: Small nets interact thousands of times per second and quickly learn the transition rules.
- Representation match: Scouts operate in the task’s tiny action space; the LLM learns from their distilled experiences instead of searching its huge vocabulary for valid moves.
- Bootstrapping: SFT moves the LLM from clueless to competent; RL then improves strategy and long-horizon planning.
- KL-guarded multi-turn PPO: Keeps learning stable while optimizing whole trajectories, not just single turns.
Building blocks (with mini-sandwiches):
-
🍞 Hook: When you enter a maze, it’s faster to send a quick runner first. 🥬 The Concept: Exploration Stage trains scouts (tiny MLP/CNN) with RL (DQN/PPO) on raw states to learn the environment. How it works: (1) Interact a lot. (2) Learn what actions lead to rewards. (3) Save expert rollouts. Why it matters: Fast coverage of the state space without burning GPU on LLM tokens. 🍞 Anchor: The runner returns with the best path sketched out.
-
🍞 Hook: Notes from a friend are easier to study when they’re in your language. 🥬 The Concept: Distillation Stage turns scout trajectories into text (via a textualizer) and fine-tunes the LLM with SFT. How it works: (1) Convert (state, action, reward) into dialogue turns. (2) Train LLM to imitate the expert actions. Why it matters: Gives the LLM the task’s physics in a format it understands. 🍞 Anchor: Like class notes rewritten in your handwriting so you remember them better.
-
🍞 Hook: After basics, you practice drills to get championship-ready. 🥬 The Concept: Evolving Stage uses multi-turn RL (trajectory-level PPO) so the LLM plans and adapts. How it works: (1) Let the LLM act over multiple steps. (2) Reward full-episode success. (3) Keep it close to its good habits (KL regularization). Why it matters: Moves from copying to outperforming, activating strategic reasoning. 🍞 Anchor: Scrimmage games that sharpen plays until the team wins consistently.
-
🍞 Hook: Translators help two people speak smoothly. 🥬 The Concept: State-Text Mapping (Textualizer) converts symbolic states into clean, deterministic dialogue without hand-coding rules. How it works: (1) Read gym environment states. (2) Fill a standard template. (3) Produce user/assistant turns. Why it matters: No fragile prompt engineering; consistent, automatable data. 🍞 Anchor: Turning a Sokoban grid into a readable map with clear actions.
-
🍞 Hook: A recipe is easier to learn if you watch a chef make it step-by-step. 🥬 The Concept: Expert Trajectories are high-quality recorded sequences scouts followed to succeed. How it works: (1) Train scout. (2) Freeze best policy. (3) Roll out many wins and near-wins. (4) Save sequences. Why it matters: They are gold-standard examples for the LLM to learn from. 🍞 Anchor: A time-lapse video showing exactly how to solve a Rubik’s Cube scramble.
03Methodology
At a high level: Environment state → Exploration Stage (scouts learn and collect expert trajectories) → Distillation Stage (textualize + SFT the LLM) → Evolving Stage (multi-turn PPO on the LLM) → Skilled agent.
Step-by-step details:
- Exploration Stage (scouts with RL)
- What happens: Tiny MLPs/CNNs (no language) interact directly with symbolic states in tasks like FrozenLake (static/slippery), Sokoban (Box1/Box2), 2048, Sudoku (4×4), and 2×2 Rubik’s Cube. They are trained with standard RL (DQN for value-based control; PPO for policy-based control) to maximize rewards.
- Why this step exists: It’s vastly cheaper and faster to let small nets map the environment’s rules. LLM exploration is slow due to huge vocabularies and token-by-token action output.
- Concrete example: In 2048, a scout rapidly tries thousands of Up/Right/Down/Left sequences, learning that keeping rows aligned and saving space improves long-term merges.
- Secret sauce: High-throughput interaction. Scouts can run on CPUs with tiny memory and find good policies quickly.
- Distillation Stage (textualizer → SFT)
- What happens: Convert each scout trajectory (state, action, reward, next state) into a clean multi-turn conversation the LLM can read. Then do Supervised Fine-Tuning so the LLM learns to produce the right actions from the text state.
- Why this step exists: The LLM needs language-formatted examples to absorb the environment’s physics. SFT gives it a strong starting policy instead of random guessing.
- Concrete example: A Sokoban rollout becomes: User shows grid + step count → Assistant outputs an action (e.g., Right) → User returns reward and updated grid, repeating until success.
- Secret sauce: Deterministic mapping (no hand-written tricks). The a_think (inner thoughts) are left blank here to keep data compact and on-task.
- Evolving Stage (multi-turn PPO on the LLM)
- What happens: Now the warmed-up LLM plays full episodes. We optimize entire trajectories, not just single messages. The reward comes from task success; a KL term keeps learning stable by not drifting too far from the SFT policy.
- Why this step exists: SFT copies competence but may cap out at the scout’s level. Multi-turn RL unlocks strategy, planning, and generalization so the LLM can surpass its teacher.
- Concrete example: In Sudoku, the SFT model follows rules but misses deep deductions. After multi-turn PPO, it plans ahead, notes box/row/column constraints in its think steps, and consistently finds correct placements.
- Secret sauce: Trajectory-level optimization that respects temporal credit assignment—actions now affect much later outcomes.
- Putting it together (flow with example data):
- Input: A Rubik’s Cube scramble (Rotation3) in symbolic form.
- Exploration: Scouts try sequences like U, R, F’, …, and learn which patterns reduce disorder.
- Distillation: Convert best scout rollouts to dialogue (state description → action → reward → next state).
- SFT: Train the LLM to imitate these moves when it reads the same descriptions.
- Evolving: Run multi-turn PPO to refine strategies; the LLM begins producing thoughtful plans in <think> and matching answers in <answer>.
- Output: The LLM solves most scrambles quickly and reliably.
- Extra clever bits:
- Action-space alignment: Scouts learn in a tiny discrete action set; the LLM, after SFT, outputs those same actions, avoiding token-search chaos.
- Context efficiency: Starting from concise expert paths shortens episodes and reduces token counts, accelerating RL.
- Flexibility: Works across tasks with very different dynamics (stochastic slipping in FrozenLake vs. combinatorial logic in Sudoku vs. long-horizon 2048).
Mini-sandwich spotlights for core algorithms:
-
🍞 Hook: Choosing between two moves is like picking the better of two paths. 🥬 The Concept: DQN estimates which action leads to higher long-term reward and picks the best. How it works: (1) Learn action values from experience. (2) Use a target network for stable learning. (3) Choose the action with the highest value. Why it matters: Very sample-efficient for discrete actions. 🍞 Anchor: In FrozenLake, DQN prefers safer moves that avoid holes while still reaching the goal.
-
🍞 Hook: Practicing a routine carefully, without changing too much at once. 🥬 The Concept: PPO improves a policy while clipping big jumps so learning stays stable. How it works: (1) Compare new vs. old action probabilities. (2) Clip the ratio to avoid wild updates. (3) Repeat small safe steps. Why it matters: Smooth learning that avoids collapsing. 🍞 Anchor: In Sokoban, PPO gradually favors box pushes that don’t create deadlocks.
-
🍞 Hook: Writing a plan before acting helps you do better. 🥬 The Concept: Multi-turn PPO on the LLM optimizes whole conversations (state → think → answer → next state …). How it works: (1) Reward full-episode success. (2) Penalize drifting too far from a good base policy. (3) Encourage useful thoughts that lead to better final answers. Why it matters: Planning across steps beats single-shot guessing. 🍞 Anchor: In Sudoku, the LLM notes where a '2' must go and then commits to it correctly.
04Experiments & Results
The tests: The authors evaluated across six unseen, symbolic/spatial tasks with varying difficulty: Bandit, FrozenLake (Static/Slippery), Sokoban (1/2 boxes), Sudoku (4×4), 2048 (very long horizon), and 2×2 Rubik’s Cube (Rotation1/2/3). They measured pass@1 (success rate) and normalized 2048 returns, and also analyzed training cost (GPU hours) and multi-task learning stability.
The competition: SCOUT (with Qwen2.5 Instruct models at 0.5B, 1.5B, 3B and LLaMA3.1-1B) was compared to:
- Open baselines: RAGEN, State Estimation RL, SPA, and direct Multi-turn PPO.
- Proprietary or large open models: GPT-4o-mini, DeepSeek-V3, GPT-OSS-120B, GPT-5-nano, Gemini-2.5-Pro.
- Also, the scouts themselves (DQN, PPO) served as reference teachers.
The scoreboard with context:
- Headline: With SCOUT, Qwen2.5-3B-Instruct reached an average score of 0.86—like getting an A—while strong proprietary Gemini-2.5-Pro scored 0.60 (a solid B- to C+), and other baselines lagged further.
- Cost savings: On tough Rubik’s Cube (Rotation3), direct PPO needed 24.0 GPU-hours; SCOUT needed only 9.6 GPU-hours—about 60% less—by offloading exploration to CPU-friendly scouts and starting RL from concise expert paths.
- SFT vs RL gains: SFT alone often gave big jumps (e.g., near-90% on Rubik’s Cube), but multi-turn PPO then “activated” deeper strategy, especially on Sudoku (from 0.29 after SFT to 0.97 after RL).
- Scout quality: Scout-DQN generally outperformed Scout-PPO on several discrete-action tasks, likely due to off-policy sample efficiency. Still, the LLM + SCOUT pipeline ultimately surpassed both.
Surprising findings:
- Surpass the teacher: After SFT + multi-turn RL, the LLM sometimes outperformed the scout that taught it. That’s like a student beating the coach in a friendly match.
- Task-dependent activation: Some tasks (Rubik’s Cube) benefited hugely from SFT alone; others (Sudoku) needed RL to unlock latent rules learned during SFT.
- Multi-task stability: In sequential RL across five tasks, SCOUT avoided catastrophic forgetting and ended with ~0.91 average, maintaining earlier skills while mastering new ones. Plain sequential RL, without SCOUT warm-up, stagnated around 0.37.
Concrete numbers (highlights):
- Qwen2.5-3B-Instruct + SCOUT: average 0.86, beating Gemini-2.5-Pro at 0.60.
- SFT checkpoints often exceeded multi-turn PPO baselines trained from scratch, proving the value of scout-generated trajectories.
- Scalability: Bigger backbones (0.5B → 3B) improved averages from ~0.81 to ~0.86 under SCOUT, showing consistent gains.
Meaning of these scores: On puzzles where most models stumble or wander, SCOUT-trained LLMs behave like seasoned players—planning ahead, avoiding traps, and finishing more puzzles correctly in fewer tries, all while spending far less compute during learning.
Takeaway: Decoupling exploration (small, fast scouts) from exploitation (LLM reasoning) changes the game. You get higher scores, steadier learning, and much lower cost.
05Discussion & Limitations
Limitations:
- Scope of backbones: Results are shown mainly up to 3B parameter models (Qwen2.5 series and LLaMA3.1-1B). Larger models or varied architectures might behave differently or even better.
- RL stability: As in much of RL, some runs showed performance dips after more training. Stabilizing multi-turn RL for LLMs remains an open challenge.
- Textualizer assumptions: While deterministic and template-based, it assumes clean access to environment interfaces. Messier real-world settings may need more robust serialization.
- Domain shift: Though SCOUT handles many symbolic/spatial tasks, extremely complex 3D physics or rich multimodal worlds may require additional components (e.g., vision encoders).
Required resources:
- For scouts: Modest CPUs and <1 GB RAM. Fast and cheap.
- For LLM SFT and multi-turn PPO: Access to GPUs (e.g., 8Ă— H100 in experiments). Training is significantly cheaper than direct PPO-from-scratch but still non-trivial.
When not to use SCOUT:
- If your task is already language-native (e.g., dialogue, summarization), direct prompting/fine-tuning may suffice.
- If the environment is tiny or trivial, scout exploration adds overhead without much benefit.
- If you cannot obtain reliable state interfaces to build the textualizer, distillation may be difficult.
Open questions:
- Can we automate or learn the textualizer end-to-end for messy, real-world inputs?
- How far can SCOUT scale to larger, multimodal agents (vision, audio) in robotics and UI automation?
- Can we further stabilize multi-turn RL (e.g., better objectives, curricula, or safety constraints)?
- What’s the best way to share scout knowledge across related tasks to speed up transfer even more?
- Can we compress the LLM after training to keep runtime costs as low as the scouts while preserving skill?
06Conclusion & Future Work
Three-sentence summary:
- SCOUT splits exploration and exploitation: tiny scouts learn the rules of new symbolic/spatial worlds quickly and cheaply, then teach the LLM through SFT.
- The LLM refines its skills with multi-turn RL, activating deeper planning and often surpassing the scout.
- This approach reaches top scores on tough unseen tasks while cutting training GPU hours by about 60%.
Main achievement:
- A simple, powerful recipe—scout exploration + textualized distillation + trajectory-level RL—that lets a small open LLM (3B) beat strong proprietary systems on a broad suite of unseen tasks.
Future directions:
- Extend SCOUT to richer multimodal settings (vision, speech), automate textualization for messy inputs, and improve multi-turn RL stability.
- Explore knowledge sharing across task families and compress trained LLMs for cheaper deployment.
Why remember this:
- SCOUT shows that experience beats guesswork: let small models explore, let big models reason. This division of labor turns LLM agents from clumsy explorers into confident problem-solvers—faster, cheaper, and better.
Practical Applications
- •Automate spreadsheet workflows that require multi-step logic (sorting, merging, deduping) with fewer errors.
- •Solve warehouse path-planning and box-stacking tasks (Sokoban-like) more efficiently.
- •Train robotics simulators to learn environment physics cheaply before deploying to real robots.
- •Build puzzle tutors (Sudoku, 2048) that teach strategy after learning from fast scouts.
- •Create UI agents that learn app navigation flows via scouts, then generalize with an LLM.
- •Speed up game AI development by using scouts to discover strategies and LLMs to explain them.
- •Develop multi-task assistants that add new skills without forgetting old ones.
- •Prototype safe exploration in risky domains by sandboxing scouts first, then distilling to LLMs.
- •Accelerate scientific simulation control (e.g., parameter sweeps) by offloading exploration to scouts.
- •Reduce cloud costs for RL fine-tuning of LLM agents by avoiding token-heavy exploration.