DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

Yinger Zhang; Shutong Jiang; Renhao Li; Jianhong Tu; Yang Su; Lianghao Deng; Xudong Guo; Chenxu Lv; Junyang Lin

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

Intermediate

Yinger Zhang, Shutong Jiang, Renhao Li et al.1/26/2026

arXiv PDF

Key Summary

•DeepPlanning is a new benchmark that tests whether AI can make long, realistic plans that fit time and money limits.
•It covers two tough, real-world jobs: multi-day travel planning and multi-product shopping with coupons and shipping times.
•The benchmark forces AIs to look up facts with tools, follow local rules (like store hours), and also keep the whole plan under global limits (like total budget).
•Everything runs in offline sandboxes with fixed databases and Python tools, so results are reproducible and easy to check by code.
•Models are scored by commonsense plan quality and whether they meet personal user needs, plus strict case accuracy that requires perfection.
•Even the strongest models often fail to make a fully correct plan end-to-end; the best travel case accuracy is only about 35%.
•Reasoning-enabled models do better than non-reasoning ones, and careful tool use boosts performance.
•More tool calls generally mean better plans, but there’s a trade-off between performance and cost (time/turns).
•Parallel tool use and reliable step-by-step reasoning patterns help balance effectiveness and efficiency.
•Error analysis shows three common weak spots: missing key info, breaking hidden real-world rules, and failing to keep the whole plan consistent.

Why This Research Matters

Real life is full of plans that must work from start to finish—vacations, class schedules, shopping budgets, and deliveries. DeepPlanning tests whether AI can handle that real-world complexity, not just answer a small sub-question. By forcing tools-based fact checking, local rule-following, and whole-plan optimization, it pushes AI toward reliability we can trust. This matters for consumers (cheaper, hassle-free shopping), travelers (no missed flights due to timing errors), and businesses (fewer costly planning mistakes). It also guides researchers to build agents that can backtrack, verify, and balance performance with cost. In short, it’s a stepping stone to AI assistants that can truly plan like careful humans.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine planning a week-long class trip. You can’t just pick a bus and a lunch spot. You must check opening hours, ticket prices, travel time between places, and the total cost so you don’t run out of money by day three.

🥬 Filling (Concept 1: Agentic Planning)

What it is: Agentic planning is when an AI doesn’t just answer questions—it uses tools (like search or databases) to act step by step and build a plan.
How it works:
1. Read the user’s goal.
2. Call tools to fetch real data (flights, hotels, products).
3. Combine the results into a plan.
4. Check if the plan fits rules (budget, timing).
Why it matters: Without tool use, the AI might guess and get facts wrong (like a flight time), breaking the whole plan. 🍞 Bottom Bread (Anchor): Like a student planning a field trip by checking bus schedules and museum hours online, then making a full-day schedule that actually works.

🥬 Filling (Concept 2: Long-Horizon Planning)

What it is: Long-horizon planning means making decisions that stretch over many steps or days, where early choices affect later ones.
How it works:
1. Set overall goals (e.g., 7-day trip under $X).
2. Break into days and steps.
3. Make choices that don’t ruin later parts (like avoiding an attraction closed on the only free afternoon).
4. Keep checking the whole plan as you go.
Why it matters: If you only think one step ahead, you’ll accidentally cause conflicts later (like overlapping events or overspending). 🍞 Bottom Bread (Anchor): Planning a science fair build that needs parts ordered early so the final demo can be ready on time.

🥬 Filling (Concept 3: Global Constrained Optimization)

What it is: This is picking the best whole-plan solution while respecting big-picture limits (total time, total budget, cross-day dependencies).
How it works:
1. Add up all costs and times across all days.
2. Check them against trip-level limits.
3. Adjust choices to keep the plan valid and as good as possible.
4. Re-check until everything fits.
Why it matters: If you only fix small pieces, the entire plan can still break (e.g., perfect lunches but over-budget trip). 🍞 Bottom Bread (Anchor): Choosing a combo of soccer practices, homework time, and family dinners so the weekly schedule fits and nothing overlaps.

The World Before: Many AI “tool-use” tests focused on tiny tasks, like picking a hotel by amenities, not on completing a full trip under one budget. That meant AIs could look smart locally but still fail globally.

The Problem: Real life needs plans that work from start to finish. You need both local rule-following (like opening hours) and global rule-following (like not exceeding the total budget), plus the skill to go find missing info.

Failed Attempts: Past benchmarks often had weak global limits (easy budgets), trivial local rules, or ignored the messy step of gathering info. Others were too abstract and didn’t feel like life.

The Gap: We lacked a realistic, checkable way to test whether an AI can gather info, obey fine-grained rules, and still hit big-picture targets across many steps.

🥬 Filling (Concept 4: DeepPlanning)

What it is: DeepPlanning is a benchmark that tests whether AIs can do long, realistic planning with both local and global constraints, using tools to fetch real data.
How it works:
1. Two domains: multi-day travel and multi-product shopping.
2. Offline sandboxes with fixed databases and Python tools.
3. Layered tasks that add personal and environmental constraints.
4. Code-based checkers score plans for realism and rule-following.
Why it matters: Without a rigorous, realistic test, we can’t trust that AIs will plan correctly in the real world. 🍞 Bottom Bread (Anchor): It’s like a practice league for planners: same fields, same rules, and referees that check every detail fairly.

Real Stakes: If your vacation plan overlaps times, you miss your flight. If your shopping cart ignores coupon rules, you overpay. A good benchmark helps build agents we can rely on for travel, shopping, schedules, and more.

02Core Idea

🍞 Top Bread (Hook): You know how a great coach doesn’t just pick the next play—they map out the whole game and the whole season so everything works together? That’s the kind of planning this paper wants AIs to master.

🥬 Filling (Concept 5: The “Aha!” of DeepPlanning)

What it is: The key idea is to test AI planning as a whole journey—forcing the AI to gather facts with tools, satisfy small rules, and still meet big trip-wide limits—inside a sandbox where everything is checkable by code.
How it works:
1. Build realistic tasks (travel, shopping) from real/synthetic data.
2. Require proactive information acquisition via tools.
3. Enforce local constraints (opening hours, sizes, ratings).
4. Enforce global constraints (total time, total budget, cross-item coupons).
5. Verify with rule-based checkers for objective scoring.
Why it matters: If the AI can do each piece but can’t keep the whole plan correct, it still fails. This benchmark measures true planning. 🍞 Bottom Bread (Anchor): Like grading a school project not only on the poster and the speech but also on finishing on deadline and under the materials budget.

Multiple Analogies for the Same Idea:

Puzzle Analogy: It’s not enough to fit a few pieces; the final picture must be complete and correct.
Cooking-Week Analogy: Buy groceries (info), follow recipes (local rules), and make sure the total food bill and cooking time for the week stay within limits (global constraints).
Sports Tournament Analogy: Win individual matches (subtasks) and also manage player stamina and total schedule so you can win the whole tournament (global plan).

Before vs After:

Before: Benchmarks checked local skills—filter a hotel, pick a single flight.
After: DeepPlanning checks if the whole plan stands up—day-by-day timing, moving between places, and final budget math.

🥬 Filling (Concept 6: Proactive Information Acquisition)

What it is: The AI must look up missing facts instead of guessing.
How it works:
1. Detect info gaps (e.g., attraction coordinates, stock, coupons).
2. Call the correct tool to fetch them.
3. Use those facts in the plan.
4. Re-query if conflicts appear.
Why it matters: Guessing leads to fake times or prices, breaking the plan. 🍞 Bottom Bread (Anchor): Checking a bus timetable before planning when to leave for the museum.

🥬 Filling (Concept 7: Local Constrained Reasoning)

What it is: Obey rules inside each step—like opening hours, sizes, brand filters, or meal durations.
How it works:
1. Read the rule (e.g., dinner 1–2 hours; attraction within open hours).
2. Apply it to your chosen item.
3. Discard items that fail.
4. Keep only valid choices.
Why it matters: One broken step (closed attraction) can derail the day. 🍞 Bottom Bread (Anchor): Don’t schedule lunch at 3 a.m. if the restaurant closes at 9 p.m.

🥬 Filling (Concept 8: Global Constrained Optimization)

What it is: Make the whole plan optimal under big limits like total budget and time.
How it works:
1. Sum costs/times across days/items.
2. Compare to overall limits.
3. Trade off options (slightly pricier item to unlock a better coupon; closer attraction to save travel time).
4. Pick the combination that best satisfies everything.
Why it matters: A plan that looks fine step-by-step can still be impossible overall. 🍞 Bottom Bread (Anchor): Buying a more expensive jersey if it lets you use a big coupon that makes the total cart cheaper.

Why It Works (Intuition, no equations):

Offline sandboxes freeze the world so tools return stable, fair data.
Layered constraints create realistic friction points that force careful planning.
Rule-based checkers act like referees: they don’t “feel,” they verify.
Unique optimal solutions avoid ambiguity and reward correct reasoning paths.

Building Blocks:

Travel domain: minute-level itineraries across transport, hotels, attractions, meals.
Shopping domain: multi-product carts, sizes, shipping, and coupon stacking.
Scoring: commonsense realism + personalized needs + perfect-case accuracy for end-to-end success.

03Methodology

High-Level Flow: Input (user task) → Tool Calls (fetch facts) → Plan Draft (day-by-day or cart) → Self-Checks (local + global) → Final Output → Rule-Based Scoring

🥬 Filling (Concept 9: Offline Sandbox & Tools)

What it is: A sealed playground with databases and Python tools that simulate real searches.
How it works:
1. The agent can only access data via provided tools.
2. Tools return structured facts (times, prices, coordinates, stock, coupons).
3. All agents see the same data, ensuring fairness and reproducibility.
4. No outside guessing is allowed.
Why it matters: Keeps experiments fair and results verifiable, not fuzzy. 🍞 Bottom Bread (Anchor): Like a science lab where everyone uses the same equipment and samples so results can be compared.

Recipe Steps (Travel Planning example):

Parse the user’s trip (cities, dates, people, rooms, budget, extra wishes).
Query flights/trains matching time windows and seat availability.
Pick hotels matching stars/services (e.g., washer + dryer) and dates.
Use search_location to get coordinates; use query_road_route_info to connect places with travel_city steps, recording distance/duration/cost.
Recommend attractions and restaurants via recommend_attractions and recommend_restaurants; verify opening hours and durations.
Build a minute-by-minute day plan with buffers (e.g., 30–45 min after flights for baggage).
Calculate itemized costs by the given rules (per person, per room/night, per car, etc.).
Check global constraints (no time overlaps, within budget, loop completeness) and fix issues.
Output the final itinerary and budget summary in the strict format.

Without Step 4 or 8: you’d get teleporting days or over-budget plans.
Example: If a flight arrives 10:50, add 10:50–11:30 buffer for baggage before the next ride.

Recipe Steps (Shopping Planning example):

List exact product requirements (season, size, brand, ratings, sales).
search_products then filter by brand/size/range until candidates fit local rules.
Retrieve details for shipping, stock, and ratings.
Explore combinations that satisfy all requests.
Apply coupon logic: same-brand vs cross-store scopes, thresholds, and stacking order.
Compute totals and pick the cheapest cart under budget; if none, pick the absolute cheapest and report the shortfall.
Return the final cart JSON with chosen coupons and price breakdown.

Without Step 5: you’d often miss the best discount strategy.
Example: Choosing a slightly pricier Brand A jacket to unlock a higher-value cross-store coupon, making the whole cart cheaper.

🥬 Filling (Concept 10: Layered Task Generation)

What it is: A way to build challenging tasks by starting simple and adding constraints.
How it works:
1. Base Skeleton: define core (cities/dates) or item themes.
2. Personalized Constraint Injection: add user wishes (e.g., washer/dryer hotel, specific item names or budgets).
3. Environment Constraint Injection: add real-world frictions (closed attraction day, limited flight seats, coupon stacking quirks).
4. Adjust data so there’s exactly one best solution.
Why it matters: Guarantees solvable, realistic problems with a unique target so scoring is clear. 🍞 Bottom Bread (Anchor): Like building a math problem that has only one correct answer after adding the right clues.

🥬 Filling (Concept 11: Rule-Based Scoring)

What it is: Automatic code checks that verify plans without opinion.
How it works (Travel):
- Commonsense Score: 8 dimensions (route consistency, sandbox compliance, itinerary structure, time feasibility, business hours, duration rationality, cost calculation accuracy, activity diversity). Each dimension is pass/fail for 1/8 point.
- Personalized Score: 1 if all user-specific requests are satisfied, else 0.
- Composite Score: average of commonsense and personalized.
- Case Accuracy: 1 only if everything is perfect; else 0.
How it works (Shopping):
- Match Score: fraction of ground-truth items correctly selected.
- Case Accuracy: 1 only if the cart exactly matches ground truth.
Why it matters: Encourages full, correct plans—not just partially right pieces. 🍞 Bottom Bread (Anchor): Like a checklist referee who awards points only if every box is truly checked.

🥬 Filling (Concept 12: Secret Sauce)

What it is: The clever parts that make the benchmark tough but fair.
How it works:
1. Unique optimal solutions remove ambiguity.
2. Parallel tool use is allowed so smart agents can be efficient.
3. Strict name matching prevents hidden hallucinations.
4. Minute-level timing forces realistic travel connections.
Why it matters: These push agents to plan like careful humans, not just guess. 🍞 Bottom Bread (Anchor): It’s like a spelling bee where exact letters matter—close isn’t correct.

04Experiments & Results

🍞 Top Bread (Hook): Think of a class challenge: everyone gets the same map, tools, and budget, and must plan the best trip. Then the teacher uses a rubric to grade fairness, timing, and cost.

The Test: The team evaluated many top AI models on 120 travel tasks (Chinese + English variants) and 120 shopping tasks. They measured commonsense realism, meeting personal needs, exact end-to-end success (case accuracy), and in shopping, whether the chosen products matched the correct set.

The Competition: Leading families (GPT-5 series, Claude-4.5, Gemini-3, Qwen3, DeepSeek, GLM, Grok, Seed, Kimi) were tested in both non-reasoning and reasoning modes, with up to 400 tool calls per task, repeated runs for stability.

Scoreboard with Context:

Travel: Even the best model only achieved around 35% Case Accuracy. That’s like getting an A on many parts but failing to hand in a perfectly correct final project often.
Shopping: Some models scored high on Match Score but still missed perfect Case Accuracy, meaning their carts were close but not exact.
Reasoning Helps: Turning on deliberate reasoning (the model’s inner thoughts) consistently improved performance over non-reasoning modes.
Domain Differences: One model (e.g., Gemini-3-Flash-Preview) did middling in travel but shined in shopping (about 60% Case Accuracy), showing strengths vary by domain.

🥬 Filling (Concept 13: Cost–Performance Trade-off)

What it is: Doing better often costs more tool calls and turns; smarter reasoning can shift this curve.
How it works:
1. More tool calls generally improve scores because the agent checks more facts.
2. Reasoning modes get higher scores with fewer wasted steps.
3. Parallel vs sequential styles change turns: bundling many calls per turn is efficient; step-by-step can be more thorough.
Why it matters: Teams must balance speed and thoroughness. 🍞 Bottom Bread (Anchor): Like studying: more hours usually help, but smart study methods can get better grades with fewer hours.

Concrete Findings:

Some top models made ~224 tool calls per travel task to reach peak scores—heavy information gathering pays off.
Enabling reasoning in Claude-4.5-Opus raised performance while cutting interaction turns and tool calls, meaning less trial-and-error.
Within the GPT-5 family, a more sequential model outscored a more parallel one by ~12.7% but needed about 10× more turns—quality vs efficiency.

Impact of Task Complexity:

Travel scores fell as itinerary length grew from 2 to 7 days—small slips spread across days and break the plan.
Shopping accuracy dropped from Level 1→3 as more cross-item constraints and coupon timing turned it into a global optimization puzzle.

Surprising Findings:

Models often knew local rules but still failed globally (budget or timing collisions).
High Match Score in shopping didn’t guarantee perfect carts—coupon and item combinations require precise, global reasoning.
Internal reasoning reliably boosted both quality and efficiency frontiers.

05Discussion & Limitations

🍞 Top Bread (Hook): Building a giant Lego castle is hard—not just placing each block, but making sure the whole structure stands strong.

Limitations:

Domain Coverage: Only travel and shopping are included today; adding healthcare scheduling, events, or robotics would broaden realism.
Synthetic Queries: User requests are constructed from layered constraints; true live queries may differ, causing distribution shift.
Single-Turn Focus: Tasks are solved in one planning sweep; real users may chat back and forth (multi-turn), which isn’t covered yet.

Required Resources:

Offline sandboxes, databases, and Python toolkits; enough compute for hundreds of tool calls; reproducible environments and parsers.

When NOT to Use:

If you need open-web browsing or multimedia perception; if you need live, changing data; or if your research is about casual chit-chat, not strict planning.

Open Questions:

How to teach agents to notice missing info early and fix it (active querying)?
How to make agents robust to hidden constraints (like limited seats or closure days) without over-calling tools?
How to perform dependable global backtracking—correcting the whole plan when one piece changes?
How to support multi-turn user interactions where goals evolve over time?

🍞 Bottom Bread (Anchor): It’s like improving a team’s playbook: can players spot a missing defender (info), avoid fouls (local rules), and still win the game (global success) even when the opponent changes strategy (dynamic constraints)?

06Conclusion & Future Work

Three-Sentence Summary:

DeepPlanning is a realistic benchmark that tests whether AI agents can gather facts, obey local rules, and still meet whole-plan constraints across long horizons.
Using travel and shopping sandboxes with code-based scoring, it reveals that even top models often fail to produce perfectly correct, end-to-end plans.
Reasoning and smart tool use help, but robust global planning and backtracking remain open challenges.

Main Achievement:

A rigorous, reproducible way to measure true planning ability—proactive information acquisition, local constrained reasoning, and global constrained optimization—under verifiable constraints with unique optimal solutions.

Future Directions:

Expand domains (events, education timetables, healthcare), add multi-turn interactions, and develop stronger global-consistency checks and backtracking.
Explore training/finetuning that encourages parallel tool use, explicit reasoning, and reliability under long horizons.

Why Remember This:

It shifts the goalpost from “can the AI do a step?” to “can the AI deliver a whole, working plan?”—the difference between a clever move and a real win in life-like tasks.

Practical Applications

•Evaluate and compare travel-planner AIs that must produce minute-by-minute itineraries under a strict budget.
•Stress-test shopping assistants on complex carts with brand filters, sizes, shipping times, and coupon stacking for the lowest total.
•Train agents to proactively query tools (instead of guessing) by practicing on information-rich sandbox tasks.
•Benchmark reasoning modes to decide when to enable chain-of-thought for better cost–performance trade-offs.
•Develop global consistency checkers that catch timing overlaps, budget errors, or route discontinuities before finalizing plans.
•Prototype backtracking strategies that revise the whole plan when a single step fails (e.g., sold-out seats).
•Design curriculum-style tasks that gradually increase horizon length (days) and cross-item constraints to build robustness.
•Tune agents for efficient parallel tool use (bundling multiple calls per turn) where appropriate.
•Use the rule-based checker framework to create domain-specific rubrics (events, education, healthcare scheduling).
•Adopt unique-solution task generation to create clear ground truth for agent training and evaluation.

Version: 1