HiMAP-Travel: Hierarchical Multi-Agent Planning for Long-Horizon Constrained Travel

The Viet Bui; Wenjun Li; Yong Liu

HiMAP-Travel: Hierarchical Multi-Agent Planning for Long-Horizon Constrained Travel

Intermediate

The Viet Bui, Wenjun Li, Yong Liu3/5/2026

arXiv

Key Summary

•HiMAP-Travel is a team-based AI planner that splits a long trip into daily chunks so it can follow tough rules like budgets without drifting off course.
•It solves a big problem called constraint drift, where single long plans slowly forget the original rules as they get longer.
•A high-level Coordinator spreads the trip’s resources across days, and Day Executors plan each day in parallel to keep the thinking short and focused.
•A shared, synchronized global state acts like a referee that blocks overspending and duplicate bookings the moment they’re about to happen.
•If a day looks impossible, a simple bargaining signal tells the Coordinator to reshape the plan and try again quickly.
•One single model powers all roles by switching prompts (role conditioning) and is trained with GRPO so the team improves together.
•On the TravelPlanner benchmark, HiMAP-Travel reaches 52.65% Final Pass Rate, beating strong baselines by 8–18 percentage points while running about 2.5× faster.
•It also handles multi-turn changes on FlexTravelBench with 44.34% (2-turn) and 37.42% (3-turn) success, showing it can adapt mid-plan.
•Ablations show each piece matters: removing the synchronized monitor, the coordinator, bargaining, or parallelism hurts results a lot.
•Beyond vacations, the recipe can help with any long project that has shared limits—like coding big software, planning deliveries, or lab experiments.

Why This Research Matters

Big plans fall apart if they ignore shared limits like budgets, capacities, or “no duplicates.” HiMAP-Travel shows a practical way to keep plans both smart and safe by checking rules at the exact moment actions happen. That means fewer last-minute fixes, less wasted compute, and more trustworthy results. This recipe maps far beyond vacations: software projects can split modules with a shared build ledger, delivery routes can plan locally under a shared capacity cap, and labs can schedule experiments without overusing scarce equipment. Because a single model runs every role, small teams can deploy it more easily and get reliable performance without managing many separate models. In short, this turns long, messy jobs into organized teamwork with a live referee—faster, steadier, and more correct.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

The world before: You know how building a Lego city is easy when it has just a few blocks, but gets tricky when you try to add trains, roads, and rules like “no two buildings the same” and “don’t go over the brick budget”? Early AI planners were like kids great at small Lego sets. They could write short plans and use tools, but when the plan got long—like a 7-day vacation with flights, hotels, meals, and attractions—they started forgetting the big rules. That meant a single mistake on Day 1 (like overspending) could ruin the whole trip. The problem: For long trips, planners must follow hard rules (a fixed total budget, no duplicate restaurants or attractions, travel mode consistency) and soft preferences (cuisines you like). Traditional AI planned everything in one long chain. As the plan grew, the earlier instructions got buried under new tool outputs, notes, and steps. The result was what the authors call constraint drift: the longer the chain, the more the model forgets about global rules, like the total budget. Failed attempts: Many systems tried “write the whole plan, then check it.” If a rule was broken, they would refine the plan. But that wastes time: imagine writing all 7 days only to discover Day 1 broke the budget—now you redo huge chunks. As trips get longer, this verify-and-refine loop becomes slow and expensive, and it still doesn’t stop early mistakes from leaking into later days. Another path used many agents talking in long conversations to negotiate fixes, but that burns tokens and time, and still risks missing tight, coupled rules (like a shared global budget). The gap: What was missing was a way to stop errors right when they’re about to happen, instead of trying to clean them up later. Planners needed two things at once: 1) shorter, easier-thinking chunks (so they don’t forget the rules), and 2) a strict referee that blocks a bad move before it gets locked in (so the rules stay true across all days). The stakes in daily life: When you plan a real trip, a single over-priced hotel can break the budget for the rest of the week. If you double-book the same restaurant or mix forbidden transport modes, your itinerary fails. People want trip planners that are right the first time, not plans that look pretty but secretly break rules. That’s money, time, and trust. Introducing the key idea: HiMAP-Travel turns the big plan into a team sport with a coach and players. The coach (Coordinator) sets the strategy—who does what each day and rough budgets. The players (Day Executors) each plan their own day in parallel. A shared referee (synchronized global state) watches every move and instantly says “no” to overspending or duplicates. If a player reports, “This day’s goal is impossible,” there’s a short, structured bargaining round: the coach adjusts the plan, and everyone tries again. This moves planning from “write it all, then fix” to “build it right as you go.” Why this matters now: Long-horizon planning is the next step for AI agents—trips, projects, logistics, science experiments. As tasks get longer and rules get tighter, we need planners that don’t drift and don’t waste time. HiMAP-Travel shows how to keep both speed and correctness: parallelize the easy parts, and enforce the rules at the exact moment they matter.

02Core Idea

The “Aha!” moment in one sentence: Split the job into a smart coach and focused day-players who work in parallel, and enforce the shared rules with a strict, real-time referee so the plan is correct by construction, not by repair. Multiple analogies: 1) Sports team: The coach (Coordinator) gives each player (Executor) their role for the game (each day), players act in parallel on the field, and a referee (synchronized state) stops fouls (overspending, duplicates) instantly. If a play can’t work, the team calls a quick huddle (bargaining) and updates the strategy. 2) Kitchen brigade: The head chef decides the menu plan and splits tasks; each station cooks their course at the same time. A kitchen manager checks the pantry live so nobody uses the same rare spice twice or goes over cost. If a dish is impossible today, the head chef swaps it out fast. 3) School project: A project leader splits a big report into sections. Classmates write their parts at once, and a style/budget checker blocks mistakes like duplicate charts or over-page limits. If a section is too hard, the leader reassigns or reshapes it. Before vs. after: Before, a single planner wrote a very long essay. Early budget choices got forgotten later, and mistakes appeared late, forcing long rewrites. After, many short essays get written at once, each with a live checker that blocks wrong moves. If someone can’t finish, the leader reshapes the task, not the whole project. Why it works (intuition, no equations): • Shorter thinking windows: Each day-agent sees only what it needs, so it doesn’t drown in notes from other days. Shorter context = fewer memory slips about global rules. • Real-time rule enforcement: The synchronized state is like a traffic light—actions can’t “go through” if they break a rule. That prevents bad decisions from snowballing. • Fast recovery: The bargaining step is a tiny, structured signal (“budget problem,” “timing problem”) that triggers quick reshaping, not a long debate. • One brain, many hats: The same model plays coach and player by switching prompts, so skills learned in one role help the other (e.g., knowing flights are pricey helps with smarter allocations). Building blocks (each explained with the Sandwich pattern): 1) Constraint Drift. 🍞 Hook: Imagine you start a long video game quest. At first, you remember your main mission. After hours of side quests, you forget the big goal. 🥬 The concept: Constraint drift is when a planner slowly stops paying attention to the original hard rules as its plan gets longer. How it works: - The planner writes a long plan step by step. - New tool outputs and notes pile up. - The beginning rules get buried. - The model focuses on nearby details and forgets the global budget or no-duplicates rule. Why it matters: Without fixing drift, long plans look good but secretly break the rules—especially near the end. 🍞 Anchor: In a 5-day trip, a sequential planner stayed under budget on Day 1 but badly overspent by Day 5, because the original budget faded from focus. 2) Hierarchical Multi-Agent Framework. 🍞 Hook: You know how a movie set has a director and many crews working at once? 🥬 The concept: A hierarchical multi-agent framework splits planning into a top-level strategist (Coordinator) and lower-level doers (Day Executors). How it works: - The Coordinator reads the request and sets day-level goals and rough budgets. - Day Executors each plan one day independently and in parallel. - Only key shared facts (like total spend and used venues) are coordinated centrally. Why it matters: Without this split, one brain tries to do everything at once, gets overwhelmed, and drifts. 🍞 Anchor: While one Executor books Day 2’s meals, another books Day 3’s hotel, both safely staying under the shared budget set by the Coordinator. 3) Synchronized Global State. 🍞 Hook: Imagine a scoreboard that all players can see, and it won’t let you add points twice for the same goal. 🥬 The concept: The synchronized global state is a shared, locking “referee” that blocks overspending and duplicate bookings the instant they’re attempted. How it works: - An Executor proposes an action (e.g., book a restaurant). - The state checks: Will this exceed the total budget or duplicate a venue? - If yes, it rejects the action right away; if no, it commits it so others see it. Why it matters: Without this referee, parallel agents can easily overspend or double-book, creating conflicts that are hard to fix later. 🍞 Anchor: If two days try to book “Pasta Palace,” the second attempt gets blocked before it lands, so the plan stays valid. 4) Cooperative Bargaining Protocol. 🍞 Hook: Imagine two kids trading snacks at lunch. If one can’t give an apple, they quickly agree to swap a banana instead. 🥬 The concept: Bargaining is a tiny, structured back-and-forth that lets Executors say “this sub-goal is infeasible” so the Coordinator can reshape the plan. How it works: - An Executor sends a short signal: “infeasible,” with a type (budget, time, availability). - The Coordinator adjusts cities, roles, or routes. - Everyone tries again, usually only once or twice. Why it matters: Without bargaining, the team pushes forward with a bad plan or wastes time chatting in long paragraphs. 🍞 Anchor: If Day 2 can’t find a child-friendly hotel, it sends “availability infeasible,” and the Coordinator switches to a nearby city with more options. 5) Unified Role-Conditioned Policy. 🍞 Hook: Think of an actor who can play both the coach and the player just by changing costumes. 🥬 The concept: One single model runs both the Coordinator and the Executors by switching a short role prompt. How it works: - Same neural network; different role instructions. - Skills transfer between roles (e.g., tactical price sense informs strategic budget setting). - Training is simpler and more memory-efficient. Why it matters: Without a shared policy, separate models can’t easily share what they learn, and training costs rise. 🍞 Anchor: The model that learns “flights are expensive on Fridays” as an Executor also budgets extra for Fridays when it’s the Coordinator. 6) Group Relative Policy Optimization (GRPO). 🍞 Hook: Picture a soccer team reviewing several plays and learning from which ones scored best relative to the group. 🥬 The concept: GRPO is a way to train the whole team by comparing multiple attempts and nudging the policy toward what worked better than its peers. How it works: - Generate several plan attempts for the same request. - Score them with strict rule checks and preferences. - Improve the model toward the higher-scoring ones, while staying near the base behavior. Why it matters: Without a good team-training rule, the system might reward long, chatty plans instead of short, valid ones. 🍞 Anchor: If four rollouts try budget splits, the one that stays on budget and meets cuisine rules teaches the model the best next moves.

03Methodology

At a high level: Input (user’s trip request) → Coordinator (strategic split) → Parallel Day Executors (tactical plans with live checks) → Bargaining if needed (fast reshape) → Final itinerary. Step-by-step with what, why, and an example: 1) Read and structure the request (Coordinator). • What happens: The Coordinator turns the user’s message into a structured plan skeleton: which cities to visit, which days are travel vs. stay, and rough per-day budget hints that add up under the total budget. • Why this step exists: It sets the stage so day-level planning is guided, not random. Without it, Day Executors overspend early or choose mismatched cities. • Example: “3 days, $1700, 1 person, visit Rockford” becomes: Day 1: fly to Rockford, Day 2: stay, Day 3: return; budget hints tilt spending toward flights and a cheap hotel. 2) Start parallel day planning (Executors). • What happens: Each Day Executor plans its assigned day independently in a clean context window: transportation (if needed), accommodation, meals, and attractions, using the tool APIs. • Why this step exists: Short, focused planning prevents long-context drift and speeds everything up. Without it, one long chain forgets global rules over time. • Example: Day 1 books a flight and a low-cost hotel and skips optional meals to save money; Day 2 selects affordable meals and one attraction; Day 3 books the return. 3) Enforce rules live with the synchronized global state. • What happens: Whenever an Executor tries to commit a choice (like booking a hotel), the synchronized state checks if it would overspend the total budget or duplicate a venue. If so, the action is rejected immediately; otherwise, it’s committed and visible to all. • Why this step exists: It prevents conflicts that are expensive to fix later. Without it, parallel agents can overspend or double book before anyone notices. • Example: If Day 2 tries to reuse “Pasta Palace” already booked by Day 1, the commit is blocked instantly and the Executor picks a different restaurant. 4) Quick recovery when a sub-goal is infeasible (bargaining). • What happens: If an Executor can’t find any valid plan (e.g., no child-friendly hotel in budget), it sends a tiny JSON signal with the problem type (budget/time/availability). The Coordinator then tweaks the city choice, route order, or day roles and relaunches planning. • Why this step exists: It’s a fast, structured way to adapt without a long chat or full restart. Without it, the system either stalls or wastes huge compute. • Example: After an “availability infeasible” signal for Day 2, the Coordinator switches to a nearby city with cheaper, kid-friendly hotels, and everyone replans once. 5) One model, two hats (role-conditioned policy). • What happens: The same LLM is used for both Coordinator and Executors by giving it a different system prompt for each role. Training uses the same shared parameters. • Why this step exists: It lowers memory/compute, stabilizes training, and lets strategic and tactical knowledge cross-pollinate. Without it, separate models can drift apart in behavior. • Example: Tactics learned while booking cheap meals make the Coordinator more realistic when setting day budget hints. 6) Train the team with GRPO. • What happens: For each request, the system samples several full or partial plans, scores them strictly (validity first), and updates the model toward better-than-average ones. A gentle penalty keeps behavior near the base model to avoid wild swings. • Why this step exists: It makes the whole team improve at satisfying constraints, not just at sounding fluent. Without it, the model might favor verbose but invalid plans. • Example: Among four rollouts, the one that meets the budget and covers requested cuisines trains the model more than the ones that overspend or repeat venues. The secret sauce (what’s uniquely clever): • Correct-by-construction: By checking rules at commit time, not at the end, the system blocks errors before they spread. • Parallelism with safety: Agents plan days at the same time, but the synchronized state prevents them from stepping on each other’s toes. • Tiny, typed bargaining: Instead of long debates, a short “infeasible, budget” signal triggers a quick, targeted reshape. • One brain, many roles: A single model learns both strategy and tactics, improving sample efficiency and stability. Concrete walk-through example: • Input: “Plan a 3-day trip from St. Petersburg to Rockford for 1 person with$ 1700.” • Coordinator: Chooses Rockford, labels Day 1 as travel, Day 2 as stay, Day 3 as return. Sets soft budget hints (spend more on flights and hotel, save on meals). • Executors in parallel: Day 1 finds a $474 flight and a$ 210/night private room; skips meals to save money. Day 2 picks low-cost meals and one attraction. Day 3 finds a reasonable return option. • Synchronized checks: When Day 2 attempts to reuse a restaurant, it’s blocked, so it picks a different spot, staying under budget. • Bargaining (if needed): If Day 2 can’t find any valid hotel within budget, it sends “availability infeasible,” and the Coordinator toggles to a nearby city or rearranges days; then the team replans once more. • Output: A complete itinerary that satisfies all hard rules without late-stage surprises.

04Experiments & Results

The test: The authors measured Final Pass Rate (FPR)—the percent of trips that satisfy all required rules at once—plus constraint-specific scores like budget adherence. They also measured Delivery Rate (does the agent produce a syntactically correct answer) and how long the system takes. The competition: They compared to strong systems, including DeepTravel (a sequential RL agent), ATLAS (a verify-and-refine multi-agent system), and MTP (a hierarchical method). To be fair, they also ran a controlled matchup: HiMAP-Travel vs. DeepTravel with the same backbone model, same training method (GRPO), same tools, and same decoding. The scoreboard with context: • TravelPlanner (single-turn): HiMAP-Travel hits 52.65% test FPR with Qwen3-8B. That’s like getting a solid B+ while others hover around a C: +8.67 percentage points over DeepTravel (43.98%), +10.0 pp over MTP (42.68%), and +17.65 pp over ATLAS (35.00%). It also kept a perfect 100% Delivery Rate and showed much lower run-to-run variance, meaning it’s both accurate and steady. • Constraint drift proof: In 5-day trips, the sequential baseline’s budget success slid from 98% on Day 1 to 42% on Day 5—classic drift. HiMAP-Travel stayed around 95% across days, showing that short, parallel planning with live checks really does guard the rules. • FlexTravelBench (multi-turn): When users add new rules in later turns, HiMAP-Travel still performs well—44.34% for 2 turns and 37.42% for 3 turns—beating ATLAS by about 4–6 pp. The quick bargaining plus rollback helps it adapt without blowing up earlier good choices. • Speed: It’s around 2. $5× faster$ on 7-day trips (about 72 seconds vs. ~190 seconds) thanks to parallel day planning. Even with some extra tool calls from parallelism, the time savings and correctness wins make it worth it. Surprising findings: • The synchronized state almost eliminates duplicate-venue mistakes and slashes budget overflows. Ablations show that removing it drops FPR by roughly 9–12 pp and spikes duplicates. • A single shared model for both roles works better than two separate models, likely because strategic and tactical skills transfer. • Most plans succeed on the first try; when they don’t, 1–2 bargaining rounds usually fix things, meaning short, typed feedback is enough—no long arguments needed. • Larger backbones help, but the hierarchical design gives big gains even to smaller models. In short, across different tests and constraints, HiMAP-Travel is both more right and more efficient.

05Discussion & Limitations

Limitations (be specific): • The synchronized global state only enforces a few shared rules (budget, no duplicates, transport-mode consistency). It does not guarantee that every commonsense or timing rule is perfect—smart day planning is still needed, and those parts can fail. • If tools or databases give noisy or incomplete info, Executors may spin their wheels until bargaining reshapes the task. • Complex room rules (like exact room types or special house policies) remain challenging; results show these are still weak spots. • Parallelism adds some extra tool traffic; while time is saved overall, token use may rise. Required resources: • One capable LLM (e.g., Qwen3-4B or 8B) that can follow role prompts and use tools. • A small set of travel tools (search for flights, hotels, restaurants, attractions; cost/lookups). • An external synchronized store that can check/commit actions atomically (a simple service with a lock is enough). • Training with GRPO benefits performance, but the architecture itself helps even without heavy training. When not to use it: • Tiny, 1-day or 2-day plans with few constraints might not need hierarchical parallelism—the overhead may not pay off. • Domains where rules are vague or constantly shifting and can’t be checked at commit time; the referee can only enforce what’s formally encoded. • Situations needing rich, human-like negotiation among agents; the protocol here is intentionally minimal and typed, not chatty. Open questions: • Can the synchronized state grow to cover more commonsense checks (like time windows) without becoming too heavy? • How can we best choose the number of parallel Executors and scheduling to match real compute limits? • Can we automatically learn better city/route decomposition policies from scratch, not just refine them with GRPO? • How well does this recipe transfer to non-travel domains—like software builds, lab planning, or supply chains—where the objects and tools differ a lot? • Could hybrid verification (light commit-time checks plus a final global sweep) catch the few remaining edge cases without adding too much latency?

06Conclusion & Future Work

Three-sentence summary: HiMAP-Travel solves long, rule-heavy planning by splitting strategy (Coordination) from action (parallel Day Executors) and enforcing shared rules the moment actions are committed. A tiny bargaining loop lets the team quickly reshape bad sub-goals, and one shared model trained with GRPO learns both high-level and low-level skills. The result is higher accuracy, faster planning, and less variance than prior systems on tough travel benchmarks. Main achievement: Turning “generate-then-fix” into “correct-by-construction” for long-horizon plans with hard constraints, thanks to a synchronized global state, structured bargaining, and a unified role-conditioned policy. Future directions: Expand the live referee to more constraints (like schedule windows), refine automatic decomposition, and transfer the blueprint to software engineering, supply chains, and experimental design. Why remember this: It shows how to keep big plans trustworthy and fast—short, parallel thinking plus live rule enforcement beats long monologues and late repairs.

Practical Applications

•Personal trip planning assistants that reliably stay under budget and avoid duplicate venues across many days.
•Corporate travel tools that coordinate multiple employees’ itineraries under shared caps (budget, hotel capacity).
•Project management copilots that split large tasks across teams while enforcing shared resources and deadlines.
•Supply chain planners that assign local delivery routes while honoring fleet capacity and depot stock limits.
•Software build orchestrators that parallelize module work but block conflicting dependencies at commit time.
•Scientific workflow planners that schedule experiments in parallel while preventing double-booked equipment.
•Event planning systems that assign vendors and sessions across days without violating budget or overlap rules.
•Education course planners that schedule classes and rooms in parallel while enforcing room and time constraints.
•Healthcare appointment systems that coordinate many providers under shared limits (rooms, machines).
•Retail promotion planners that schedule campaigns across regions while enforcing shared budget and stock rules.

Version: 1