HiMAP-Travel: Hierarchical Multi-Agent Planning for Long-Horizon Constrained Travel
Key Summary
- â˘HiMAP-Travel is a team-based AI planner that splits a long trip into daily chunks so it can follow tough rules like budgets without drifting off course.
- â˘It solves a big problem called constraint drift, where single long plans slowly forget the original rules as they get longer.
- â˘A high-level Coordinator spreads the tripâs resources across days, and Day Executors plan each day in parallel to keep the thinking short and focused.
- â˘A shared, synchronized global state acts like a referee that blocks overspending and duplicate bookings the moment theyâre about to happen.
- â˘If a day looks impossible, a simple bargaining signal tells the Coordinator to reshape the plan and try again quickly.
- â˘One single model powers all roles by switching prompts (role conditioning) and is trained with GRPO so the team improves together.
- â˘On the TravelPlanner benchmark, HiMAP-Travel reaches 52.65% Final Pass Rate, beating strong baselines by 8â18 percentage points while running about 2.5Ă faster.
- â˘It also handles multi-turn changes on FlexTravelBench with 44.34% (2-turn) and 37.42% (3-turn) success, showing it can adapt mid-plan.
- â˘Ablations show each piece matters: removing the synchronized monitor, the coordinator, bargaining, or parallelism hurts results a lot.
- â˘Beyond vacations, the recipe can help with any long project that has shared limitsâlike coding big software, planning deliveries, or lab experiments.
Why This Research Matters
Big plans fall apart if they ignore shared limits like budgets, capacities, or âno duplicates.â HiMAP-Travel shows a practical way to keep plans both smart and safe by checking rules at the exact moment actions happen. That means fewer last-minute fixes, less wasted compute, and more trustworthy results. This recipe maps far beyond vacations: software projects can split modules with a shared build ledger, delivery routes can plan locally under a shared capacity cap, and labs can schedule experiments without overusing scarce equipment. Because a single model runs every role, small teams can deploy it more easily and get reliable performance without managing many separate models. In short, this turns long, messy jobs into organized teamwork with a live refereeâfaster, steadier, and more correct.
Reading Workflow
Turn this paper into a decision
Scan fast. Promote only the papers that survive triage.
No workflow history yet.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
The world before: You know how building a Lego city is easy when it has just a few blocks, but gets tricky when you try to add trains, roads, and rules like âno two buildings the sameâ and âdonât go over the brick budgetâ? Early AI planners were like kids great at small Lego sets. They could write short plans and use tools, but when the plan got longâlike a 7-day vacation with flights, hotels, meals, and attractionsâthey started forgetting the big rules. That meant a single mistake on Day 1 (like overspending) could ruin the whole trip. The problem: For long trips, planners must follow hard rules (a fixed total budget, no duplicate restaurants or attractions, travel mode consistency) and soft preferences (cuisines you like). Traditional AI planned everything in one long chain. As the plan grew, the earlier instructions got buried under new tool outputs, notes, and steps. The result was what the authors call constraint drift: the longer the chain, the more the model forgets about global rules, like the total budget. Failed attempts: Many systems tried âwrite the whole plan, then check it.â If a rule was broken, they would refine the plan. But that wastes time: imagine writing all 7 days only to discover Day 1 broke the budgetânow you redo huge chunks. As trips get longer, this verify-and-refine loop becomes slow and expensive, and it still doesnât stop early mistakes from leaking into later days. Another path used many agents talking in long conversations to negotiate fixes, but that burns tokens and time, and still risks missing tight, coupled rules (like a shared global budget). The gap: What was missing was a way to stop errors right when theyâre about to happen, instead of trying to clean them up later. Planners needed two things at once: 1) shorter, easier-thinking chunks (so they donât forget the rules), and 2) a strict referee that blocks a bad move before it gets locked in (so the rules stay true across all days). The stakes in daily life: When you plan a real trip, a single over-priced hotel can break the budget for the rest of the week. If you double-book the same restaurant or mix forbidden transport modes, your itinerary fails. People want trip planners that are right the first time, not plans that look pretty but secretly break rules. Thatâs money, time, and trust. Introducing the key idea: HiMAP-Travel turns the big plan into a team sport with a coach and players. The coach (Coordinator) sets the strategyâwho does what each day and rough budgets. The players (Day Executors) each plan their own day in parallel. A shared referee (synchronized global state) watches every move and instantly says ânoâ to overspending or duplicates. If a player reports, âThis dayâs goal is impossible,â thereâs a short, structured bargaining round: the coach adjusts the plan, and everyone tries again. This moves planning from âwrite it all, then fixâ to âbuild it right as you go.â Why this matters now: Long-horizon planning is the next step for AI agentsâtrips, projects, logistics, science experiments. As tasks get longer and rules get tighter, we need planners that donât drift and donât waste time. HiMAP-Travel shows how to keep both speed and correctness: parallelize the easy parts, and enforce the rules at the exact moment they matter.
02Core Idea
The âAha!â moment in one sentence: Split the job into a smart coach and focused day-players who work in parallel, and enforce the shared rules with a strict, real-time referee so the plan is correct by construction, not by repair. Multiple analogies: 1) Sports team: The coach (Coordinator) gives each player (Executor) their role for the game (each day), players act in parallel on the field, and a referee (synchronized state) stops fouls (overspending, duplicates) instantly. If a play canât work, the team calls a quick huddle (bargaining) and updates the strategy. 2) Kitchen brigade: The head chef decides the menu plan and splits tasks; each station cooks their course at the same time. A kitchen manager checks the pantry live so nobody uses the same rare spice twice or goes over cost. If a dish is impossible today, the head chef swaps it out fast. 3) School project: A project leader splits a big report into sections. Classmates write their parts at once, and a style/budget checker blocks mistakes like duplicate charts or over-page limits. If a section is too hard, the leader reassigns or reshapes it. Before vs. after: Before, a single planner wrote a very long essay. Early budget choices got forgotten later, and mistakes appeared late, forcing long rewrites. After, many short essays get written at once, each with a live checker that blocks wrong moves. If someone canât finish, the leader reshapes the task, not the whole project. Why it works (intuition, no equations): ⢠Shorter thinking windows: Each day-agent sees only what it needs, so it doesnât drown in notes from other days. Shorter context = fewer memory slips about global rules. ⢠Real-time rule enforcement: The synchronized state is like a traffic lightâactions canât âgo throughâ if they break a rule. That prevents bad decisions from snowballing. ⢠Fast recovery: The bargaining step is a tiny, structured signal (âbudget problem,â âtiming problemâ) that triggers quick reshaping, not a long debate. ⢠One brain, many hats: The same model plays coach and player by switching prompts, so skills learned in one role help the other (e.g., knowing flights are pricey helps with smarter allocations). Building blocks (each explained with the Sandwich pattern): 1) Constraint Drift. đ Hook: Imagine you start a long video game quest. At first, you remember your main mission. After hours of side quests, you forget the big goal. 𼏠The concept: Constraint drift is when a planner slowly stops paying attention to the original hard rules as its plan gets longer. How it works: - The planner writes a long plan step by step. - New tool outputs and notes pile up. - The beginning rules get buried. - The model focuses on nearby details and forgets the global budget or no-duplicates rule. Why it matters: Without fixing drift, long plans look good but secretly break the rulesâespecially near the end. đ Anchor: In a 5-day trip, a sequential planner stayed under budget on Day 1 but badly overspent by Day 5, because the original budget faded from focus. 2) Hierarchical Multi-Agent Framework. đ Hook: You know how a movie set has a director and many crews working at once? 𼏠The concept: A hierarchical multi-agent framework splits planning into a top-level strategist (Coordinator) and lower-level doers (Day Executors). How it works: - The Coordinator reads the request and sets day-level goals and rough budgets. - Day Executors each plan one day independently and in parallel. - Only key shared facts (like total spend and used venues) are coordinated centrally. Why it matters: Without this split, one brain tries to do everything at once, gets overwhelmed, and drifts. đ Anchor: While one Executor books Day 2âs meals, another books Day 3âs hotel, both safely staying under the shared budget set by the Coordinator. 3) Synchronized Global State. đ Hook: Imagine a scoreboard that all players can see, and it wonât let you add points twice for the same goal. 𼏠The concept: The synchronized global state is a shared, locking ârefereeâ that blocks overspending and duplicate bookings the instant theyâre attempted. How it works: - An Executor proposes an action (e.g., book a restaurant). - The state checks: Will this exceed the total budget or duplicate a venue? - If yes, it rejects the action right away; if no, it commits it so others see it. Why it matters: Without this referee, parallel agents can easily overspend or double-book, creating conflicts that are hard to fix later. đ Anchor: If two days try to book âPasta Palace,â the second attempt gets blocked before it lands, so the plan stays valid. 4) Cooperative Bargaining Protocol. đ Hook: Imagine two kids trading snacks at lunch. If one canât give an apple, they quickly agree to swap a banana instead. 𼏠The concept: Bargaining is a tiny, structured back-and-forth that lets Executors say âthis sub-goal is infeasibleâ so the Coordinator can reshape the plan. How it works: - An Executor sends a short signal: âinfeasible,â with a type (budget, time, availability). - The Coordinator adjusts cities, roles, or routes. - Everyone tries again, usually only once or twice. Why it matters: Without bargaining, the team pushes forward with a bad plan or wastes time chatting in long paragraphs. đ Anchor: If Day 2 canât find a child-friendly hotel, it sends âavailability infeasible,â and the Coordinator switches to a nearby city with more options. 5) Unified Role-Conditioned Policy. đ Hook: Think of an actor who can play both the coach and the player just by changing costumes. 𼏠The concept: One single model runs both the Coordinator and the Executors by switching a short role prompt. How it works: - Same neural network; different role instructions. - Skills transfer between roles (e.g., tactical price sense informs strategic budget setting). - Training is simpler and more memory-efficient. Why it matters: Without a shared policy, separate models canât easily share what they learn, and training costs rise. đ Anchor: The model that learns âflights are expensive on Fridaysâ as an Executor also budgets extra for Fridays when itâs the Coordinator. 6) Group Relative Policy Optimization (GRPO). đ Hook: Picture a soccer team reviewing several plays and learning from which ones scored best relative to the group. 𼏠The concept: GRPO is a way to train the whole team by comparing multiple attempts and nudging the policy toward what worked better than its peers. How it works: - Generate several plan attempts for the same request. - Score them with strict rule checks and preferences. - Improve the model toward the higher-scoring ones, while staying near the base behavior. Why it matters: Without a good team-training rule, the system might reward long, chatty plans instead of short, valid ones. đ Anchor: If four rollouts try budget splits, the one that stays on budget and meets cuisine rules teaches the model the best next moves.
03Methodology
At a high level: Input (userâs trip request) â Coordinator (strategic split) â Parallel Day Executors (tactical plans with live checks) â Bargaining if needed (fast reshape) â Final itinerary. Step-by-step with what, why, and an example: 1) Read and structure the request (Coordinator). ⢠What happens: The Coordinator turns the userâs message into a structured plan skeleton: which cities to visit, which days are travel vs. stay, and rough per-day budget hints that add up under the total budget. ⢠Why this step exists: It sets the stage so day-level planning is guided, not random. Without it, Day Executors overspend early or choose mismatched cities. ⢠Example: â3 days, 1700.â ⢠Coordinator: Chooses Rockford, labels Day 1 as travel, Day 2 as stay, Day 3 as return. Sets soft budget hints (spend more on flights and hotel, save on meals). ⢠Executors in parallel: Day 1 finds a 210/night private room; skips meals to save money. Day 2 picks low-cost meals and one attraction. Day 3 finds a reasonable return option. ⢠Synchronized checks: When Day 2 attempts to reuse a restaurant, itâs blocked, so it picks a different spot, staying under budget. ⢠Bargaining (if needed): If Day 2 canât find any valid hotel within budget, it sends âavailability infeasible,â and the Coordinator toggles to a nearby city or rearranges days; then the team replans once more. ⢠Output: A complete itinerary that satisfies all hard rules without late-stage surprises.
04Experiments & Results
The test: The authors measured Final Pass Rate (FPR)âthe percent of trips that satisfy all required rules at onceâplus constraint-specific scores like budget adherence. They also measured Delivery Rate (does the agent produce a syntactically correct answer) and how long the system takes. The competition: They compared to strong systems, including DeepTravel (a sequential RL agent), ATLAS (a verify-and-refine multi-agent system), and MTP (a hierarchical method). To be fair, they also ran a controlled matchup: HiMAP-Travel vs. DeepTravel with the same backbone model, same training method (GRPO), same tools, and same decoding. The scoreboard with context: ⢠TravelPlanner (single-turn): HiMAP-Travel hits 52.65% test FPR with Qwen3-8B. Thatâs like getting a solid B+ while others hover around a C: +8.67 percentage points over DeepTravel (43.98%), +10.0 pp over MTP (42.68%), and +17.65 pp over ATLAS (35.00%). It also kept a perfect 100% Delivery Rate and showed much lower run-to-run variance, meaning itâs both accurate and steady. ⢠Constraint drift proof: In 5-day trips, the sequential baselineâs budget success slid from 98% on Day 1 to 42% on Day 5âclassic drift. HiMAP-Travel stayed around 95% across days, showing that short, parallel planning with live checks really does guard the rules. ⢠FlexTravelBench (multi-turn): When users add new rules in later turns, HiMAP-Travel still performs wellâ44.34% for 2 turns and 37.42% for 3 turnsâbeating ATLAS by about 4â6 pp. The quick bargaining plus rollback helps it adapt without blowing up earlier good choices. ⢠Speed: Itâs around 2. on 7-day trips (about 72 seconds vs. ~190 seconds) thanks to parallel day planning. Even with some extra tool calls from parallelism, the time savings and correctness wins make it worth it. Surprising findings: ⢠The synchronized state almost eliminates duplicate-venue mistakes and slashes budget overflows. Ablations show that removing it drops FPR by roughly 9â12 pp and spikes duplicates. ⢠A single shared model for both roles works better than two separate models, likely because strategic and tactical skills transfer. ⢠Most plans succeed on the first try; when they donât, 1â2 bargaining rounds usually fix things, meaning short, typed feedback is enoughâno long arguments needed. ⢠Larger backbones help, but the hierarchical design gives big gains even to smaller models. In short, across different tests and constraints, HiMAP-Travel is both more right and more efficient.
05Discussion & Limitations
Limitations (be specific): ⢠The synchronized global state only enforces a few shared rules (budget, no duplicates, transport-mode consistency). It does not guarantee that every commonsense or timing rule is perfectâsmart day planning is still needed, and those parts can fail. ⢠If tools or databases give noisy or incomplete info, Executors may spin their wheels until bargaining reshapes the task. ⢠Complex room rules (like exact room types or special house policies) remain challenging; results show these are still weak spots. ⢠Parallelism adds some extra tool traffic; while time is saved overall, token use may rise. Required resources: ⢠One capable LLM (e.g., Qwen3-4B or 8B) that can follow role prompts and use tools. ⢠A small set of travel tools (search for flights, hotels, restaurants, attractions; cost/lookups). ⢠An external synchronized store that can check/commit actions atomically (a simple service with a lock is enough). ⢠Training with GRPO benefits performance, but the architecture itself helps even without heavy training. When not to use it: ⢠Tiny, 1-day or 2-day plans with few constraints might not need hierarchical parallelismâthe overhead may not pay off. ⢠Domains where rules are vague or constantly shifting and canât be checked at commit time; the referee can only enforce whatâs formally encoded. ⢠Situations needing rich, human-like negotiation among agents; the protocol here is intentionally minimal and typed, not chatty. Open questions: ⢠Can the synchronized state grow to cover more commonsense checks (like time windows) without becoming too heavy? ⢠How can we best choose the number of parallel Executors and scheduling to match real compute limits? ⢠Can we automatically learn better city/route decomposition policies from scratch, not just refine them with GRPO? ⢠How well does this recipe transfer to non-travel domainsâlike software builds, lab planning, or supply chainsâwhere the objects and tools differ a lot? ⢠Could hybrid verification (light commit-time checks plus a final global sweep) catch the few remaining edge cases without adding too much latency?
06Conclusion & Future Work
Three-sentence summary: HiMAP-Travel solves long, rule-heavy planning by splitting strategy (Coordination) from action (parallel Day Executors) and enforcing shared rules the moment actions are committed. A tiny bargaining loop lets the team quickly reshape bad sub-goals, and one shared model trained with GRPO learns both high-level and low-level skills. The result is higher accuracy, faster planning, and less variance than prior systems on tough travel benchmarks. Main achievement: Turning âgenerate-then-fixâ into âcorrect-by-constructionâ for long-horizon plans with hard constraints, thanks to a synchronized global state, structured bargaining, and a unified role-conditioned policy. Future directions: Expand the live referee to more constraints (like schedule windows), refine automatic decomposition, and transfer the blueprint to software engineering, supply chains, and experimental design. Why remember this: It shows how to keep big plans trustworthy and fastâshort, parallel thinking plus live rule enforcement beats long monologues and late repairs.
Practical Applications
- â˘Personal trip planning assistants that reliably stay under budget and avoid duplicate venues across many days.
- â˘Corporate travel tools that coordinate multiple employeesâ itineraries under shared caps (budget, hotel capacity).
- â˘Project management copilots that split large tasks across teams while enforcing shared resources and deadlines.
- â˘Supply chain planners that assign local delivery routes while honoring fleet capacity and depot stock limits.
- â˘Software build orchestrators that parallelize module work but block conflicting dependencies at commit time.
- â˘Scientific workflow planners that schedule experiments in parallel while preventing double-booked equipment.
- â˘Event planning systems that assign vendors and sessions across days without violating budget or overlap rules.
- â˘Education course planners that schedule classes and rooms in parallel while enforcing room and time constraints.
- â˘Healthcare appointment systems that coordinate many providers under shared limits (rooms, machines).
- â˘Retail promotion planners that schedule campaigns across regions while enforcing shared budget and stock rules.