Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning

Jinyang Wu; Shuo Yang; Changpeng Yang; Yuhao Shen; Shuai Zhang; Zhengqi Wen; Jianhua Tao

Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning

Intermediate

Jinyang Wu, Shuo Yang, Changpeng Yang et al.1/28/2026

arXiv PDF

Key Summary

•SPARK is a new way to train AI agents that saves compute by exploring more only at the most important moments.
•Instead of spreading effort evenly across every step, SPARK detects “critical states” using an internal <explore> signal and branches there.
•This dynamic branching turns one linear path into a small tree only when it matters, improving the chance of finding good solutions.
•With the same or fewer tokens, SPARK finds higher-quality training trajectories and learns faster than uniform exploration methods like GRPO.
•Across ALFWorld, ScienceWorld, and WebShop, SPARK achieves higher success rates and stronger generalization to unseen tasks.
•Sample efficiency jumps: SPARK matches or beats strong baselines using a fraction of the training data (e.g., >84% with just 20% data on ALFWorld L0).
•Token use drops because shared prefixes avoid repeating easy, routine steps, concentrating generation on tricky choices.
•A simple three-stage recipe (Root Initialization → Autonomous Branching → Budget Enforcement) makes SPARK practical and predictable.
•A theoretical view explains why replacing single tries at hard decisions with B-try branching multiplies success probabilities.
•SPARK is model- and optimizer-friendly: it plugs into GRPO-style RL with group-normalized advantages without changing the core learner.

Why This Research Matters

SPARK makes long, complicated tasks more reliable and affordable by spending effort only where it truly helps. That means agents that navigate websites, operate tools, or plan multi-step missions can learn faster from fewer examples. Because it reduces token waste, teams can train capable agents even with tight budgets. The method generalizes to unseen tasks better, so agents don’t freeze when the world changes. And since SPARK plugs into existing RL pipelines, labs can adopt it without a full system rewrite. In short, it’s a practical upgrade that turns raw compute into smarter learning, not just bigger numbers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you and your friends are doing a long scavenger hunt around town. If you spend equal time checking every lamppost, bench, and bush, you’ll waste hours on boring spots and miss the tricky places where clues actually hide.

🥬 The Concept (Decision Making): What it is: Decision making is choosing what to do next to reach a goal. How it works: 1) Notice your current situation, 2) Think about possible moves, 3) Pick the move that seems best, 4) See what happens, then repeat. Why it matters: Without careful choices, you get stuck or wander, especially in long tasks. 🍞 Anchor: When you cook a recipe, you decide when to chop, boil, or taste. Good decisions make dinner delicious; bad ones burn the soup.

🍞 Hook: You know how a coach helps a team improve by trying plays, seeing results, and trying again? That’s learning from experience.

🥬 The Concept (Reinforcement Learning, RL): What it is: RL teaches an AI to make better decisions by trying actions and getting rewards for success. How it works: 1) Try an action, 2) Get feedback (reward), 3) Update the strategy to prefer rewarding actions, 4) Repeat over many steps. Why it matters: Without RL, the agent can’t improve from practice, especially in long, multi-step problems. 🍞 Anchor: A robot that learns to tidy a room gets points for putting items in the right place and learns which moves work best next time.

🍞 Hook: If you don’t see the whole maze, you must guess the right turns from what you can see.

🥬 The Concept (Markov Decision Process → POMDP): What it is: A POMDP is a plan-for-the-future setup when you can’t see the whole world—only partial observations. How it works: 1) Observe part of the world, 2) Keep a memory of what happened (history), 3) Choose an action, 4) See the new observation, 5) Continue until the goal. Why it matters: Without handling partial info, the agent panics or guesses wildly and fails long missions. 🍞 Anchor: Text games and websites only show one page or room at a time, so the agent must remember past steps to choose wise next steps.

🍞 Hook: When studying, you don’t spend equal time on every page—you focus on the hard parts.

🥬 The Concept (Exploration Strategy): What it is: An exploration strategy is how an AI decides where to try new things to learn fastest. How it works: 1) Notice uncertainty, 2) Try alternative actions where it helps most, 3) Compare results, 4) Prefer the best path. Why it matters: Without a smart exploration plan, the agent wastes time on easy steps and ignores tough, important choices. 🍞 Anchor: If a math problem’s last step confuses you, you test a few possible methods there instead of rereading the obvious first step.

🍞 Hook: If your team has only 1 hour to practice, you don’t split it evenly between warm-ups and tricky plays.

🥬 The Concept (Resource Allocation): What it is: Resource allocation is deciding where to spend limited compute, tokens, or time. How it works: 1) Set a budget, 2) Spot where extra effort pays off, 3) Spend more there, less elsewhere, 4) Stop when the budget ends. Why it matters: Without smart allocation, you burn your budget on routine moves and miss key improvements. 🍞 Anchor: In a test, you give more time to the hardest questions, not the easy ones.

🍞 Hook: Suppose every time the path gets confusing, you momentarily try a couple of turns in parallel before picking the best.

🥬 The Concept (Dynamic Branching): What it is: Dynamic branching means the agent creates extra “what-if” paths only at critical moments. How it works: 1) Walk forward normally, 2) When uncertainty is high, split into a few branches, 3) Evaluate outcomes, 4) Continue with the best. Why it matters: Without branching at hard spots, you miss promising solutions; branching everywhere explodes cost. 🍞 Anchor: In a maze, you test two possible forks only when the corridor splits, not at every straight hallway.

🍞 Hook: Your brain sometimes says, “Hmm, I’m not sure—let me look harder.”

🥬 The Concept (Intrinsic Decision-Making Signals): What it is: These are self-generated signals (like an <explore> tag) that tell the agent, “This spot is uncertain; try more options.” How it works: 1) The agent reasons in text, 2) If the reasoning detects ambiguity, it emits <explore>, 3) The system branches there, 4) Learning prefers branches that succeed. Why it matters: Without internal signals, you need handcrafted rules or wasteful blind search. 🍞 Anchor: While shopping online, if details are unclear, you open two tabs to compare before buying.

The world before SPARK: RL helped language-based agents in math and coding, but long missions (like household tasks or web shopping) were tough. Good full-length success stories (trajectories) were rare, and budgets were tight. Many methods simply rolled out many attempts, spending equal effort at each step, which meant tons of compute on trivial moves (like “open fridge”) and too little on pivotal choices (like “what do I try when the expected ingredient is missing?”).

The problem: With uniform exploration, agents burned tokens on routine steps and still missed the few critical moments that decide success.

Failed attempts: Bigger rollouts and search tried to cover more ground, but they still allocated compute uniformly, didn’t guarantee useful samples, and didn’t generalize robustly beyond seen tasks.

The gap: Agents needed a way to sense “this decision is hard” and then focus exploration there—without humans hand-marking such spots.

Real stakes: Better resource use means faster training, cheaper runs, and agents that don’t fall apart when the world changes. That helps robots, assistants, and web agents handle long, messy tasks more reliably.

02Core Idea

🍞 Hook: Imagine a choose-your-own-adventure book where you only branch when the story reaches a cliffhanger—not on every boring page.

🥬 The Concept (Strategic Exploration): What it is: Strategic exploration means spending extra effort only where it greatly boosts your chance to finish the task. How it works: 1) Sense uncertainty at a step, 2) Open a few alternative continuations there, 3) Compare the outcomes, 4) Keep going with the best path. Why it matters: Without this, agents either branch everywhere (too expensive) or almost nowhere (miss solutions). 🍞 Anchor: When building LEGO, you try a couple of tricky connections only at hard joints, not after every block.

The “Aha!” moment in one sentence: Use the agent’s own reasoning to detect critical states and trigger small, budgeted branching only there, so compute moves from routine steps to decision bottlenecks.

Three analogies:

Hiking guide: You walk normally on clear trails but pause at confusing forks to scout two short paths before choosing one.
Studying smart: You breeze through easy problems but try multiple solution ideas only on the hardest ones.
Detective work: You don’t recheck every alibi; you dig deeper only into clues with contradictions.

Before vs. after:

Before (Uniform): The agent samples evenly at all steps, creating many near-duplicates and missing turning points.
After (SPARK): The agent continues linearly through routine states and branches only at “SPARK points”—moments flagged by an internal <explore> signal—raising the chance of landing on a successful trajectory with the same or fewer tokens.

🍞 The Concept (Action Selection): What it is: Choosing which action to take next from the allowed options. How it works: 1) Score options with your policy, 2) Sample or pick the best, 3) Observe the result. Why it matters: Wrong actions at key steps domino into long failures. 🍞 Anchor: Picking the right subway line at a hub station makes the entire trip work; a mistake there ruins the route.

Why it works (intuition without math):

Success in long tasks often hinges on a few pivotal choices. If you try just one action at each critical step, the odds of getting them all right multiply to a small number. But if you branch into B tries at each critical step, your chance of getting at least one good action there jumps a lot. Multiply those boosts across a handful of critical steps, and your total success rate rises sharply—without needing more full trajectories.
Because branches share long prefixes, you don’t regenerate tokens for easy early steps repeatedly; you reuse context and spend generation where it counts.

Building blocks:

SPARK points: Critical decision states identified by the agent’s internal <explore> tag in its reasoning trace.
Dynamic branching: Create B child continuations only at SPARK points; keep b=1 (no split) at routine steps.
Budget awareness: A global leaf budget N ensures you never overspend; branching factors are clipped to stay within N.
Root diversity: Start with M distinct roots so you don’t put all your eggs in one starting basket.
Tree-based update with GRPO-style training: Group completed leaves from the same task, propagate terminal rewards (success/failure), and use group-normalized advantages so branches that share prefixes can be compared fairly.
Lightweight SFT: A small supervised warm-up teaches the model to emit the <explore> signal; RL then refines when to branch and which branches pay off.

🍞 Anchor: In ALFWorld, if the fridge and table have no eggs, <explore> triggers branching to check sink or drawers. Only the uncertain step gets extra tries, speeding success without wasting effort on obvious moves like “walk to microwave.”

03Methodology

At a high level: Input (task + history) → Step A: Root Initialization → Step B: Autonomous Branching at SPARK points → Step C: Budget Enforcement → Output: A small tree of trajectories used for a GRPO-style policy update.

🍞 Hook: Think of training as planting a few seed paths, letting them sprout extra branches only where the soil is tricky, and pruning the tree so it fits the pot.

🥬 The Concept (Trajectory): What it is: A trajectory is the full story of observations, thoughts, and actions from start to finish. How it works: 1) See observation, 2) Think (reasoning trace), 3) Act, 4) Repeat, until success/failure. Why it matters: Learning uses these stories to figure out which decisions lead to wins. 🍞 Anchor: A cooking log listing each step you took from preheating to serving is a trajectory of the meal.

Step-by-step recipe:

Root Initialization (diversity first)

What happens: From the initial query or state, the policy samples M different reasoning+action starts, creating M roots. This captures different plausible openings when the task is most ambiguous.
Why it exists: If you only start from one root, you might get stuck exploring the same neighborhood and miss better starts.
Example: For “make breakfast,” roots could start with “check fridge,” “check pantry,” “search countertop,” or “review utensils.” In practice, M≈4.

Autonomous Branching (branch only when it matters)

What happens: At each step of each active trajectory, the model writes a short reasoning trace. If the trace includes the <explore> tag—its intrinsic uncertainty signal—the branching criterion B(·) sets b=B (e.g., B=2). Otherwise, b=1 and the path continues linearly. The system samples b children (reasoning+action) from the updated context.
Why it exists: Extra tries at routine steps are wasteful; extra tries at hard steps are gold. The <explore> tag turns the agent’s own doubt into a precise branching trigger.
Example: If expected ingredients are missing, <explore> might produce two child plans: try the sink area vs. try the drawers.

Budget Enforcement (never overspend)

What happens: There’s a global cap N on total active leaves. When a node asks for b children, we compute an effective branching factor b_eff = min(b, N − N_current + 1) so the tree never grows beyond N.
Why it exists: Without a strict budget, branching could explode and erase the compute savings.
Example: With N=8, if you already have 7 leaves, a new request for 2 children is clipped to 1.

Termination and Grouping (collect the leaves)

What happens: Paths end on success, failure, or max steps. All completed leaves from the same task form a group G. Their shared prefixes create natural comparisons between alternative decisions at the exact same points.
Why it exists: Grouping lets us evaluate sibling branches fairly and assign credit to the better choices.
Example: Two branches that only differ at “where to search next” show directly which choice led to a find.

Tree-Based Policy Update (GRPO-friendly)

What happens: Assign binary terminal rewards (1 for success, 0 for failure) to each completed leaf and propagate them along the path. Use group-normalized advantages so sibling branches become a built-in “A/B test” at each differing step. Optimize with a clipped objective and a KL penalty to a reference policy, just like GRPO pipelines.
Why it exists: This plugs into existing critic-free RL setups while leveraging the decision-focused comparisons that the tree provides.
Example: If the “search sink” branch wins and “recheck fridge” loses, the advantage favors the winning action at that fork.

🍞 The Concept (Budget): What it is: A limit on how many branches or tokens you can generate. How it works: 1) Set N (total leaf budget), set M (roots), set B (branching factor), 2) Use b_eff to clip growth, 3) Stop generation when N is filled or episodes end. Why it matters: Budgets force the method to be efficient; otherwise costs blow up. 🍞 Anchor: Like having 8 tickets for rides—you choose where to use them for the most fun.

A tiny run-through with actual numbers:

Defaults: N=8, M=4, B=2.
Start: Create 4 roots. Now, at step t, two paths show <explore>; each requests b=2. If only 4 leaves remain under budget, both get their 2 children. If later only 1 slot remains, a new b=2 request is clipped to b_eff=1.
Outcome: You finish with up to 8 completed (or terminated) leaves across the small trees.
Update: The successful leaves get reward 1, failures get 0. Group-relative advantages tell the policy which choices to keep.

Secret sauce:

Intrinsic <explore> tag bridges thinking and acting: the model’s uncertainty in its own words triggers branching exactly where it is needed—no human heuristics.
Shared prefixes compress tokens: Many branches reuse the same early context, cutting repeated generation on easy steps.
Grouped leaves = built-in experiments: Sibling branches make fair, apples-to-apples comparisons at critical decisions.

🍞 Anchor: Picture a homework session where you write one shared intro, branch into two middle paragraphs only if the argument is unclear, then keep the better one. You save time while improving quality.

04Experiments & Results

🍞 Hook: If two basketball teams get the same number of shots, the one that takes more smart shots usually wins.

🥬 The Concept (Evaluation/Test): What it is: A fair way to check if an idea works—compare success on shared tasks. How it works: 1) Pick benchmarks, 2) Compare to strong baselines, 3) Measure success rates and scores, 4) Analyze efficiency and generalization. Why it matters: Without fair tests, you can’t tell if the improvement is real or just luck. 🍞 Anchor: Science fairs judge all projects with the same rubric so results are comparable.

The test beds:

ALFWorld (embodied household tasks), ScienceWorld (scientific procedures with long horizons), WebShop (web navigation and shopping at scale). Splits L0/L1/L2 test seen vs. unseen categories and instances.

The competition:

Prompting: ReAct.
RL: GRPO, ETO, GiGPO, RLVMR.
Closed-source LLMs: GPT-4o, GPT-5-mini, GPT-5, Gemini-2.5-Pro.

The scoreboard (with context):

Overall wins: SPARK outperforms all baselines across ALFWorld, ScienceWorld, and WebShop with both 1.5B and 7B backbones. On tough ScienceWorld L2, SPARK reaches 49.2% (1.5B) and 57.8% (7B), beating strong RL baselines by large margins. On ALFWorld L2, SPARK-1.5B hits 80.5% vs. GRPO’s 29.7% (about 2.7× higher). On WebShop, SPARK improves both score and success.
Like grades: Where some baselines hover around a C or B-, SPARK jumps to an A-/A in many settings, especially on unseen tasks.

Sample efficiency (learning with fewer examples):

On ALFWorld L0 with only 20% training data, SPARK already achieves 84.4% success—exceeding GRPO trained on 100% data (76.6%). With 40% data, SPARK matches or beats strong baselines at full data.
On ScienceWorld L0, SPARK leads consistently across data scales.
Translation: It’s like learning the semester’s material by studying only the most important 40% of notes and still acing the test.

Token efficiency (doing more with fewer tokens):

Relative to chain-like rollouts normalized to 100, SPARK uses about 93.1 (ALFWorld), 53.0 (ScienceWorld), and 88.8 (WebShop). That’s up to 47% fewer tokens because shared prefixes avoid redoing routine steps.

Inference scalability (more tries at test time):

With Pass@16 (generate 32 candidates, evaluate best 16), SPARK achieves 94.9% on ScienceWorld—about 30 points above the strongest baseline—showing that focusing on higher-quality samples pays off even when you scale inference.

Surprising findings:

Small wins big: SPARK-1.5B beats large closed models on some hard splits (e.g., ScienceWorld L2 vs. GPT-5/Gemini-2.5-Pro), proving strategy can trump raw size.
The harder the task, the bigger the gap: Gains are largest on multi-step, coordination-heavy tasks (e.g., ALFWorld Pick2), exactly where selective branching helps most.
Fewer loops: Repetitive action ratios drop notably (e.g., ALFWorld L2: 15.4% vs. GRPO’s 27.1%), confirming more purposeful exploration.

🍞 Anchor: It’s like practicing the piano: focusing repetitions on the tricky bar instead of replaying the easy intro makes you performance-ready faster.

05Discussion & Limitations

🍞 Hook: Even the best strategy has places it shines and places it struggles—like running shoes that are amazing on a track but not on a rocky trail.

🥬 The Concept (Limitation): What it is: A clear boundary of where a method may underperform. How it works: 1) Identify tough scenarios, 2) Explain why stress appears, 3) Suggest remedies. Why it matters: Knowing limits helps you use tools wisely and improve them next. 🍞 Anchor: You wouldn’t use a paintbrush to hammer nails; tools have proper uses.

Limitations:

Very low-capability base models may not emit reliable <explore> signals, so SPARK can miss true critical states or branch at routine steps.
If a task has many critical decisions densely spread across the trajectory, selective branching brings less advantage over uniform exploration.
SPARK still needs a small SFT warm-up to bootstrap the <explore> skill.

Required resources:

Compute for group rollouts (e.g., N≈8 leaves per task), short SFT (≈300 trajectories), and standard RL training (GRPO-style). 4×A100-80GB GPUs were used in experiments.

When NOT to use:

Very short tasks with trivial decisions—uniform exploration is already cheap.
Domains with perfect, dense intermediate rewards and easy verification, where standard uniform tree search or dense process rewards already work well.
Extremely uncertain environments where nearly every step is critical; the budget may still be stretched thin.

Open questions:

How to auto-calibrate <explore> for weaker models (e.g., learned uncertainty estimators or combining internal and external signals)?
Can branching factor B and root count M be adapted online per task to squeeze more value from a fixed N?
How to extend to multimodal, tool-rich settings where critical states depend on vision, APIs, or memory retrievals?
Can we couple SPARK with process reward models without losing its budget discipline?

🍞 Anchor: Think of SPARK as a smart flashlight—it’s fantastic for spotlighting tricky corners in long, dark hallways, but you still need a good battery (base model) and enough beams (budget) to cover a mansion.

06Conclusion & Future Work

Three-sentence summary: SPARK trains long-horizon agents to branch only at critical decision points detected by an intrinsic <explore> signal, reallocating compute from easy steps to hard ones. This dynamic, budgeted tree of trajectories yields higher-quality samples, better success rates, fewer tokens, and stronger generalization than uniform exploration. It plugs into GRPO-style RL without redesigning the learner, making it practical and efficient.

Main achievement: Turning internal reasoning signals into precise, compute-aware branching so exploration becomes strategic, not blind.

Future directions: Auto-calibrate <explore> for weaker models, adapt B and M per task, blend intrinsic signals with light external feedback, and scale to multimodal/tool-using agents while preserving budget efficiency.

Why remember this: In long tasks, a few choices decide everything—SPARK finds and invests in those moments, proving that “where you look” can matter more than “how big you are.”

Practical Applications

•Household robots that branch only when unsure where an item is, speeding up tidy-up missions.
•Customer-service chatbots that explore alternate troubleshooting steps only when initial fixes fail.
•Web shopping agents that try multiple product paths at confusing filters, not at every click.
•Code assistants that branch on tricky bug fixes while skipping routine refactors.
•Data pipeline orchestrators that explore fallback data sources only when the primary is missing.
•Scientific lab assistants that try alternative experiment steps when measurements look off.
•Education tutors that test a couple of new hints only when a student is stuck on a key concept.
•Tool-using agents that branch between APIs only when the current tool gives ambiguous results.
•Game-playing agents that explore multiple tactics at pivotal turns rather than each move.
•Workflow schedulers that branch contingency plans only when deadlines or resources become uncertain.

Version: 1