Meta-RL Induces Exploration in Language Agents

Yulun Jiang; Liangze Jiang; Damien Teney; Michael Moor; Maria Brbic

Meta-RL Induces Exploration in Language Agents

Intermediate

Yulun Jiang, Liangze Jiang, Damien Teney et al.12/18/2025

arXiv PDF

Key Summary

•This paper introduces LAMER, a Meta-RL training framework that teaches language agents to explore first and then use what they learned to solve tasks faster.
•Instead of treating each attempt as separate, LAMER links multiple attempts (episodes) of the same task into a single trial and spreads credit across them.
•A key idea is in-context policy adaptation via self-reflection: after each attempt, the agent writes a short lesson to guide the next attempt—no weight updates needed.
•LAMER improves success rates over strong RL baselines by 11% on Sokoban, 19% on MineSweeper, and 14% on Webshop at pass@3.
•The method balances exploration and exploitation using a cross-episode discount factor, which nudges the agent to gather information early and capitalize later.
•LAMER preserves more trajectory diversity than standard RL, which makes its exploration more robust.
•The trained agents generalize better to harder settings and unseen task types (notably in ALFWorld).
•Although training can be slower due to sequential episodes per trial, the data budget is matched to RL and can be sped up with smarter rollout strategies.
•Meta-RL gives a principled way to induce exploration in language agents and turns test-time compute into quick, on-the-fly learning.

Why This Research Matters

Real-world tasks are messy and often unclear on the first try. LAMER equips language agents with a principled way to explore, learn from their own mistakes, and do better just a moment later—without retraining. That means smarter shopping assistants that quickly zero in on the right item, tutoring bots that adapt to a student’s misunderstandings, and household agents that stay calm and curious when rooms look different. By turning test-time attempts into a mini learning process, LAMER makes agents robust in new situations. This approach can reduce frustration, save time, and make AI helpers feel more thoughtful and reliable in everyday life.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you try a puzzle game, your first try is mostly about figuring out the rules, and your second try is where you crush it because now you know what to avoid? That first attempt teaches you where to explore, and the next attempt is where you use that knowledge to win.

🥬 The Concept: Trial-and-Error Learning

What it is: Learning by trying things, noticing what happens, and adjusting next time.
How it works:
1. Make an attempt.
2. See the feedback (win/lose, clues, points).
3. Change your plan based on that feedback.
4. Try again with a better strategy.
Why it matters: Without trial-and-error, you repeat the same mistakes and never improve. 🍞 Anchor: In Minesweeper, if you click a bomb, you lose—but now you know where not to click next time.

🍞 Hook: Imagine a really good assistant that can read instructions, remember what happened, and type out step-by-step plans to reach a goal.

🥬 The Concept: Language Model (LLM) Agents

What it is: An AI that reads and writes text to act in a world, step by step.
How it works:
1. The environment sends a text observation (like a game board or a website page).
2. The agent replies with an action (like “click cell (3,2)” or “search[blue loafers]”).
3. The environment updates and sends new text.
4. Repeat until success or failure.
Why it matters: Without an agent that can read, reason, and act, you can’t solve multi-turn tasks like shopping or games. 🍞 Anchor: In Webshop, the agent reads the product page and decides whether to click, search, or buy.

🍞 Hook: Think of choosing between tasting new foods (explore) and ordering your favorite dish (exploit). If you only explore, you never enjoy a full meal; if you only exploit, you might miss discovering something better.

🥬 The Concept: Exploration vs. Exploitation

What it is: Balancing trying new things (exploring) and using what you know works (exploiting).
How it works:
1. At first, try a variety of actions to gather clues.
2. Use those clues to pick smarter actions next time.
3. Keep adjusting the balance based on remaining uncertainty.
Why it matters: Without exploration, agents get stuck with mediocre habits; without exploitation, they never finish the task. 🍞 Anchor: In Sokoban, pushing a box too soon can trap it; exploring safe moves first helps plan a path to win.

The world before: LLM agents could follow instructions and even chain their thoughts. But in multi-turn, long-horizon tasks (like Minesweeper or navigating a website), they often failed to explore smartly. Standard RL trained them to do well on average but not to adapt quickly at test time from their own fresh mistakes. Many prior methods either used single attempts, relied on offline imitation of other explorers, or were great at one-turn problems like math—none truly taught agents to explore in interactive environments.

The problem: When the success signal is sparse (you only learn at the end whether you succeeded), standard RL often learns a fixed play style that avoids risk. That can look fine on easy cases but breaks on new or tricky ones. The result: agents don’t probe uncertainty early, don’t learn from their own recent experience mid-task, and don’t scale well when given multiple tries.

Failed attempts: Prompting tricks (like ReAct) helped agents explain and act, and Reflexion added post-episode notes—but without a training objective that spreads credit across multiple attempts, these reflections weren’t consistently optimized. Offline distillation learned from past explorers but didn’t teach agents to be explorers themselves in new worlds. Strong RL algorithms (PPO, GRPO, GiGPO) improved stability and reward-chasing but still tended to reduce diversity and exploration.

The gap: What was missing was a way to make “explore early, exploit later” the thing being trained—not just a nice idea. We needed a framework that treats several attempts at the same task as a single learning process, where the first attempts are encouraged to gather information and the later ones use it, with credit assigned across all of them.

🍞 Hook: Imagine studying across several quizzes that all cover the same topic. If your teacher only grades the last quiz, you won’t spend time experimenting early. But if your teacher counts all quizzes—with a bonus for improving—you’ll try new strategies first and then apply what you learned.

🥬 The Concept: Meta-Reinforcement Learning (Meta-RL)

What it is: Training an agent across many tasks so it learns how to learn—fast—when it meets a new one.
How it works:
1. Outer loop: Train on many similar tasks so the agent picks up good exploration habits.
2. Inner loop: Within a task, the agent adapts using the experience from earlier attempts.
3. The training objective rewards the whole adapt-then-succeed process.
Why it matters: Without Meta-RL, agents memorize one policy and don’t adapt quickly to new or harder versions of a task. 🍞 Anchor: In Minesweeper, a Meta-RL-trained agent learns to test uncertain spots early and then finishes the board safely on the next try.

Real stakes: Exploration is what lets assistants figure out confusing websites, unfamiliar game levels, or new device settings. It means better shopping helpers, smarter tutoring systems, and home robots that don’t freeze when something is different. LAMER’s goal is simple: teach agents to explore actively, reflect on what they learned, and adapt on the fly—even during testing—so they keep getting better within a few tries.

02Core Idea

🍞 Hook: Picture a two-round scavenger hunt. In round one, you roam around and map the area. In round two, you zip straight to the treasure because you now know where not to waste time.

🥬 The Concept: LAMER’s Key Insight

What it is: Train language agents to treat multiple attempts at a task as one learning arc—explore in early attempts, then exploit in later attempts—using in-context self-reflection instead of changing model weights.
How it works:
1. Group several episodes on the same task into a single “trial.”
2. Encourage early episodes to gather information (explore), then later ones to use it (exploit).
3. After each episode, the agent writes a reflection that becomes part of the next attempt’s context.
4. Train the whole process with a cross-episode return so the agent is rewarded for smart exploration that pays off later.
Why it matters: Without linking episodes and rewarding explore-then-exploit behavior, agents won’t naturally learn to adapt mid-task. 🍞 Anchor: On Webshop, the agent first tries broader searches to learn the catalog, then narrows in on the exact color/size/price item in the second or third attempt.

Multiple analogies:

Chef analogy: First taste the dish and adjust seasoning (explore), then serve the perfected plate (exploit).
Sports analogy: First scrimmage tests plays and opponent habits (explore), then the rematch uses those insights to score (exploit).
Detective analogy: First gather clues and rule out suspects (explore), then make the arrest (exploit).

🍞 Hook: You know how teachers sometimes grade improvement over time, not just the final test? That grading changes how you study.

🥬 The Concept: Cross-Episode Training Framework

What it is: Treat several episodes on the same task as a single unit and spread credit across them.
How it works:
1. Run Episode 1; get feedback.
2. Write a short reflection about mistakes and next plan.
3. Run Episode 2 using that reflection as context.
4. Repeat for a few episodes; assign rewards across all of them, not just the last.
Why it matters: Without cross-episode credit, agents won’t invest in exploration that only pays off later. 🍞 Anchor: In Sokoban, a risky early probe move is rewarded if it helps complete the puzzle a few moves later.

🍞 Hook: Imagine taping a sticky note to your desk after each try: “Don’t push the left box; it gets stuck.” Next time you see the puzzle, that note guides you.

🥬 The Concept: In-Context Policy Adaptation via Self-Reflection

What it is: The agent updates its behavior by writing and reading its own reflections between episodes—no parameter updates needed.
How it works:
1. After an episode, the agent summarizes what didn’t work and a new plan.
2. That summary is added to the next prompt as memory.
3. The agent follows the improved plan in the next attempt.
Why it matters: Without in-context adaptation, the model repeats mistakes and can’t quickly pivot during testing. 🍞 Anchor: In Minesweeper, a reflection like “Cells around a ‘2’ need exactly two mines—avoid random clicks here” steers the next move.

Before vs. After:

Before: RL agents mostly chased immediate rewards within one episode and became more deterministic, losing useful diversity. Test-time multiple attempts didn’t help much.
After: With LAMER, agents preserve helpful exploration, learn from their own feedback, and show big gains from pass@1 to pass@3.

🍞 Hook: Think of a thermostat that learns. If a room is too cold now but warms up in 10 minutes, the thermostat shouldn’t overreact—it needs to think across time.

🥬 The Concept: Why It Works (intuition)

What it is: LAMER changes the goal so early information-gathering is valuable only if it helps later success.
How it works:
1. Reward flows across episodes, so smart early probing gets credit later when it leads to wins.
2. Reflections compress what was learned into short, actionable advice.
3. The agent learns a general strategy that transfers to new, harder, and even unseen tasks.
Why it matters: Without long-horizon credit and trained reflections, exploration stays random and short-sighted. 🍞 Anchor: On Webshop, broad search terms in attempt 1 are rewarded because they let the agent precisely filter and buy the right item in attempt 2.

Building blocks:

Cross-episode return: reward is added up across attempts to value exploration.
Reflection memory: short, focused notes that guide the next attempt.
In-context adaptation: behavior changes using context alone, no gradients at test time.
Policy gradient training: standard RL optimization aligns the agent with this explore-then-exploit objective.

03Methodology

High-level pipeline: Input (a task instance) → Episode 1 (act, get feedback) → Self-reflection (write a brief lesson) → Episode 2 (act, guided by reflection) → ... up to N episodes → Output (success/failure and total cross-episode return).

Step-by-step details:

Set up tasks and attempts

What happens: Sample a task (e.g., a Sokoban board, a Minesweeper grid, a shopping goal). Decide a small number of episodes N (e.g., 3) to allow explore-then-exploit.
Why this step exists: Multi-attempt structure creates room to explore early and benefit later.
Example: Minesweeper 6×6 with 3 mines; the agent gets 3 chances to clear the board.

Run Episode 1 (explore)

What happens: The agent observes the environment and takes actions. In partially observable tasks, it tests uncertain places to reveal structure.
Why this step exists: Early probing gathers clues that can’t be known without trying.
Example: Click a safe-looking corner in Minesweeper to open space and numbers, even if it doesn’t immediately win.

Write self-reflection

What happens: After the episode, the agent summarizes mistakes, patterns noticed, and a refined plan for the next episode. This text is added to the prompt for Episode 2.
Why this step exists: Converts experience into a compact plan, enabling in-context adaptation.
Example: “I clicked near a ‘2’ without deducing both mines—next time, mark those two and open the others.”

Run Episode 2 (guided exploitation)

What happens: The agent now acts using the reflection as memory, focusing on high-value actions that use the newly learned info.
Why this step exists: Capitalizes on exploration by following the improved plan.
Example: Search narrower terms in Webshop or push the right box first in Sokoban.

Optional: Episode 3

What happens: Repeat reflection and adaptation once more if budget allows.
Why this step exists: A second cycle can fix remaining gaps and lock in success.
Example: In Minesweeper, after deducing two mines, safely open the remaining cells.

Cross-episode credit assignment

What happens: The training objective sums rewards across episodes so that Episode 1 gets credit if it helps Episode 2 succeed.
Why this step exists: Incentivizes exploring early, exploiting later.
Example: A risky but informative click in Episode 1 is rewarded when Episode 2 wins.

🍞 Hook: Think of a dimmer switch that sets how much you care about later outcomes versus immediate ones.

🥬 The Concept: Cross-Episode Discount Factor (γ_traj)

What it is: A dial that controls how much later-episode rewards count back to earlier episodes.
How it works:
1. Small γ_traj: favor quick exploitation; less incentive to explore early.
2. Medium γ_traj: balanced; explore enough, then exploit.
3. Large γ_traj: heavy emphasis on long-term payoff; more exploration.
Why it matters: Without this dial, you can’t tune exploration vs. exploitation per environment. 🍞 Anchor: In Minesweeper, higher γ_traj helps—exploring numbers and patterns first pays off later.

Optimization with standard RL

What happens: Use policy-gradient-style updates (e.g., PPO, GRPO, GiGPO) to align the agent with the cross-episode objective. The reflection step itself is trained, since its quality affects the next episode’s reward.
Why this step exists: Provides a practical, scalable way to learn the strategy.
Example: The agent that writes clearer reflections does better in Episode 2 and gets reinforced.

Memory configuration

What happens: The context can include (a) trajectory history, (b) reflection notes, or (c) both. Empirically, reflection-only often worked best: concise guidance beats long logs.
Why this step exists: Keeps context focused and token costs lower.
Example: “Avoid pushing left box into corner; route around wall first” is more useful than pages of raw moves.

Compute budget and efficiency

What happens: RL and Meta-RL use the same number of total trajectories per update, but Meta-RL episodes within a trial must run sequentially (less parallelism). In current code, this roughly doubles wall-clock time.
Why this step exists: Ensures fair data usage while recognizing training-time trade-offs.
Example: Asynchronous rollouts could recover parallelism later.

The secret sauce:

Cross-episode return: It makes the agent value information-gathering that only pays off later.
Trained self-reflection: It turns experience into a sharp, portable plan—exactly what LLMs excel at in-context.
Preserved diversity: Unlike standard RL that collapses to a few patterns, Meta-RL keeps useful variety, which fuels better exploration.

🍞 Hook: Picture a scoreboard that counts how many wins you get if you’re allowed 1, 2, or 3 tries.

🥬 The Concept: pass@k (Test-Time Scaling)

What it is: The chance of success if the agent is allowed k attempts.
How it works:
1. Run the agent once: pass@1.
2. If it fails, try again with reflection: pass@2.
3. Try again: pass@3.
Why it matters: If pass@k grows a lot with more tries, the agent is learning from its own attempts. 🍞 Anchor: LAMER’s pass@3 jumps much more than RL’s, showing real test-time learning.

🍞 Hook: Imagine counting how many different paths people take through a maze to see who’s really exploring.

🥬 The Concept: Trajectory Diversity (Entropy)

What it is: A measure of how varied the agent’s action sequences are across runs.
How it works:
1. Sample many trajectories.
2. Group identical ones and compute their probabilities.
3. Higher entropy = more diversity.
Why it matters: Without diversity, exploration is weak; with too much random diversity, success drops. LAMER keeps the right kind. 🍞 Anchor: In Minesweeper, LAMER tries multiple reasonable openings instead of one rigid pattern, then locks in a plan with reflection.

04Experiments & Results

The test: The authors measured how often agents succeed under 1, 2, or 3 attempts (pass@1, pass@2, pass@3) on three main environments—Sokoban (planning), Minesweeper (logical deduction with hidden info), and Webshop (goal-driven web navigation)—plus generalization to unseen tasks in ALFWorld.

The competition: LAMER was compared to prompting baselines (Zero-shot, ReAct, Reflexion) and strong RL methods (PPO, RLOO, GRPO, GiGPO). GiGPO is a top-performing RL baseline.

Scoreboard with context:

Sokoban: LAMER reached 55.9% pass@3, beating GiGPO’s 44.1% by about 12 points (like moving from a solid B to an A-). It also improved pass@2 and even matched or beat pass@1.
Minesweeper: LAMER hit 74.4% pass@3, a 19-point gain over RL’s 55.1% (a big jump from a C+ to a solid B+/A-). While RL could start slightly higher at pass@1, LAMER surged ahead by pass@2 and pass@3.
Webshop: LAMER achieved 89.1% pass@3 vs. 75.2% for GiGPO (about +14 points), showing that explore-then-exploit helps even in web navigation and filtering tasks.

Surprising and notable findings:

Strong test-time scaling: LAMER’s pass@k grows much faster with each added attempt than RL’s. This shows that the agent truly learns from its own mistakes via reflection and cross-episode credit.
Diversity preserved: Standard RL tended to reduce trajectory diversity (becoming deterministic), which made it brittle. LAMER preserved more diversity while still raising success—better exploration without chaos.

Harder tasks generalization:

When Sokoban added more boxes or Minesweeper added more mines, all methods dropped in success, as expected. But LAMER consistently outperformed RL at every difficulty, keeping about a 10-point edge at the hardest Sokoban setting and a 5-point edge at the hardest Minesweeper setting. This suggests the learned exploration strategy scales with difficulty.

Out-of-distribution generalization (ALFWorld):

Trained on Pick, Look, Clean, Heat (in-distribution), tested also on Cool and Pick2 (out-of-distribution). RL already beat prompting on in-distribution. But LAMER went further: it not only topped RL in-distribution but also delivered big jumps on out-of-distribution tasks (e.g., +23 points on Cool, +14 points on Pick2). This indicates that the exploration habits and reflection-based adaptation transfer to new task variants.

Ablations and controls:

Cross-episode discount γ_traj: Best value depends on the environment. Minesweeper liked higher γ_traj (more long-term credit), while Sokoban/Webshop did best around a middle value (0.6). The key is that γ_traj is a practical dial for exploration/exploitation balance.
Memory content: Reflection-only memory outperformed both trajectory-only and the combination of both, likely because concise lessons focus the agent better than raw logs.
Fair training budget: The authors matched total trajectories per update between RL and Meta-RL to ensure fairness. Meta-RL incurred about 2× wall-clock time due to sequential episodes, suggesting room for engineering optimizations.

Takeaway: LAMER’s meta-RL objective and trained reflection mechanism let agents explore usefully, adapt quickly at test time, and generalize to harder and novel tasks—outperforming robust RL baselines in multiple domains.

05Discussion & Limitations

Limitations:

Training speed: Because episodes within a trial depend on each other, they must run sequentially, reducing parallelism and increasing wall-clock time (about 2× in the reported setup).
Hyperparameter tuning: The best γ_traj (the exploration vs. exploitation dial) varies by environment, so some tuning is needed.
Scope of evaluation: While environments are diverse (games, web, embodied text), broader real-world tests (e.g., complex web ecosystems, longer horizons) would strengthen conclusions.

Required resources:

A capable instruction-tuned LLM (e.g., Qwen3-4B, Llama3.1-8B) with stable RL finetuning infrastructure.
RL training stack supporting policy gradients over multi-episode trials and reflection generation.
Compute budget for sequential trials (or engineering to parallelize asynchronously).

When NOT to use:

One-shot tasks with immediate, dense feedback where exploration doesn’t help (e.g., very short, fully observed tasks) may see little benefit from cross-episode structure.
Extremely tight test-time latency constraints where there isn’t budget for multiple attempts or reflection writing.
Domains where reflection text cannot be safely or effectively included in context (e.g., strict input-format protocols without room for notes).

Open questions:

How far can this scale? What happens with 5–10 episodes per trial or much longer horizons?
Can verifier models or structured search be combined with LAMER to further improve reflection quality and safety?
How to best automate γ_traj tuning—could the agent learn to set it per task family?
Can asynchronous or batched trial execution recover parallelism without hurting adaptation quality?
What reflection styles (templates, checklists, criticism vs. planning) work best across domains?

06Conclusion & Future Work

Three-sentence summary: LAMER is a Meta-RL framework that links multiple attempts of the same task into a single learning arc, rewarding explore-then-exploit behavior. It adapts in-context using trained self-reflections, so the agent learns from its own feedback at test time without changing weights. Across games, web tasks, and text-based embodied environments, LAMER beats strong RL baselines, scales better with more attempts, and generalizes to harder and unseen tasks.

Main achievement: Turning exploration from an afterthought into a trained, cross-episode strategy—and making reflection a first-class, optimized component of test-time adaptation.

Future directions:

Combine LAMER with advanced verifiers, better advantage estimation, and stronger base models.
Engineer asynchronous rollouts to reduce wall-clock cost while preserving cross-episode adaptation.
Extend to broader multi-modal environments and longer-horizon tasks, and study automatic tuning of exploration dials like γ_traj.

Why remember this: LAMER shows a principled way to induce exploration in language agents by training them to learn within a few tries. It reframes test-time compute as fast learning—using notes the agent writes to itself—so agents don’t just act; they adapt. That shift makes them more robust, curious, and ready for the messy, changing problems we actually care about.

Practical Applications

•Smarter shopping bots that try broad searches first, then narrow down to exact specs within a few attempts.
•Game assistants that probe safe moves early and then execute winning strategies in puzzle games like Minesweeper or Sokoban.
•Customer support agents that explore multiple troubleshooting paths before committing to a fix.
•Educational tutors that reflect on a student’s wrong answers and adapt the next explanation accordingly.
•Web navigation agents that learn website structure in the first try and complete tasks efficiently on the second.
•Robotic planning via text interfaces where the agent scouts the environment, then follows a refined plan.
•Data labeling copilots that experiment with labeling rules on a few samples and then apply consistent criteria.
•Research assistants that test different query formulations before locking onto the most fruitful search path.
•Process automation scripts that attempt alternative workflows and adopt the best-performing one in subsequent runs.

Version: 1