TourPlanner: A Competitive Consensus Framework with Constraint-Gated Reinforcement Learning for Travel Planning

Yinuo Wang; Mining Tan; Wenxiang Jiao; Xiaoxi Li; Hao Wang; Xuanyu Zhang; Yuan Lu; Weiming Dong

TourPlanner: A Competitive Consensus Framework with Constraint-Gated Reinforcement Learning for Travel Planning

Intermediate

Yinuo Wang, Mining Tan, Wenxiang Jiao et al.1/8/2026

arXiv PDF

Key Summary

•TourPlanner is a travel-planning system that first gathers the right places, then lets multiple expert ‘voices’ debate plans, and finally polishes the winner with a learning method that follows rules before style.
•It solves three big problems: too many places to consider, only one reasoning path, and the difficulty of balancing strict rules (hard constraints) with nice-to-have preferences (soft constraints).
•Its PReSO step filters and clusters places so the plan is compact in space and fits the traveler’s tastes.
•Its CCoT step runs several specialized agents in parallel, then uses fair scoring and consensus to combine the best parts into one daily plan.
•Its Constraint-Gated Reinforcement Learning uses a sigmoid gate so the system learns to pass hard rules first, then optimize comfort, budget, and personalization.
•On the TripTailor benchmark, TourPlanner achieves 100% feasibility across models, greatly improves spatial efficiency, and raises the overall win rate over strong baselines.
•It reduces average route distance ratio from up to 5.98 to 2.15, meaning far less crisscrossing and wasted travel.
•The recall step (PReSO) finds more ground-truth items than a prior workflow, boosting data quality for planning.
•Ablations show the consensus debate (CCoT) and the gated reward are both essential; removing them clearly hurts performance.
•This approach is model-agnostic and robust, suggesting the framework design—not just a single large model—is what drives the gains.

Why This Research Matters

Trip planning is stressful, and bad plans waste time and money. TourPlanner shows that AI can plan realistic, compact routes that match personal tastes while respecting must-follow rules. This helps families, schools, and tour companies build smoother days with fewer surprises. Because it works across different language models, it’s practical to deploy in many apps and services. Its “rules first, polish second” lesson applies to other planning tasks like deliveries, events, and education. In short, it turns big, messy choices into trusted, easy-to-follow daily plans.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you're packing for a school trip with your class. There are so many places to visit, meals to eat, and rules to follow that it’s easy to get overwhelmed.

🥬 Filling (The Actual Concept – Why this research?):

What it is: Travel planning by AI means turning a person’s wish list into a day-by-day trip that is realistic, efficient, and fun.
How it worked before: Many systems stuffed tons of places into the model, picked a single plan, and tried to satisfy all rules and preferences at once.
Why that’s a problem: Too many places overflow the AI’s memory; one plan misses better options; and mixing strict rules (like opening hours) with softer wishes (like cuisine and vibe) often breaks something.

🍞 Bottom Bread (Anchor): Like trying to plan the whole class trip in one go, with everyone shouting ideas—you end up missing museum hours or walking too far between stops.

New Concept 1 — Hard vs. Soft Constraints 🍞 Hook: You know how your school says “No running in hallways” (must-do) but your teacher says “Try to sit with new friends” (nice-to-have)? 🥬 The Concept:

What it is: Hard constraints are strict rules you cannot break (e.g., opening hours, not repeating the same place); soft constraints are preferences you try to optimize (e.g., short routes, tasty meals, budget balance).
How it works: First check if the plan obeys the must-do rules; then improve comfort, variety, and personalization.
Why it matters: If you ignore hard rules, the trip can’t happen at all. 🍞 Anchor: Visiting a museum after it closes breaks a hard rule; choosing a slightly cheaper or tastier lunch tweaks a soft preference.

New Concept 2 — Reinforcement Learning (RL) 🍞 Hook: Think of a video game where you learn from points: do the right moves, your score goes up. 🥬 The Concept:

What it is: RL teaches an AI to make better choices by rewarding good plans and penalizing bad ones.
How it works: The AI tries a plan, gets a score, and updates how it plans next time to get higher scores.
Why it matters: It helps the AI discover strategies that follow rules and improve quality over time. 🍞 Anchor: If an itinerary avoids closed attractions (points!) and shortens walking distance (more points!), the AI keeps doing more of that.

The World Before:

LLM travel planners had three recurring pains: too many candidate places of interest (POIs), only one reasoning path, and juggling hard vs. soft constraints.
Benchmarks like TravelPlanner and TripTailor showed LLMs could talk well but stumbled on grounding, constraints, and spatial logic.

The Problem:

Pruning the millions of POIs down without losing the important ones.
Exploring only one plan path, so the model misses better alternatives.
Optimizing both hard rules and soft preferences at once, which often caused trade-offs to fail.

Failed Attempts:

Stuffing more context into the model → memory limits and noisy choices.
Relying on a single chain-of-thought plan → less exploration and brittle results.
Simply adding hard and soft rewards in RL → the model “chases comfort” before it truly follows rules.

The Gap:

A pipeline was needed that: (a) recalls the right POIs and arranges them in space, (b) debates multiple candidate plans, and (c) learns with a curriculum that respects hard rules first.

Real Stakes:

People waste time and money on unrealistic or zig-zaggy routes.
Families need reliable schedules; businesses need repeatable planning.
Better AI planning helps tourists, event organizers, schools, and tour companies deliver smoother, happier trips.

02Core Idea

The “Aha!” in one sentence: Let multiple expert planners compete and agree on a plan built from the right places, then train the planner to value strict rules before polishing preferences.

Multiple Analogies:

Team Debate: Several classmates (history fan, foodie, budget-keeper) each propose a day plan, critique each other, and a teacher merges the best parts.
Cooking Show: Chefs cook different versions of the same dish; judges score them; the final recipe keeps the tastiest pieces while following the kitchen’s safety rules.
Sports Tryouts: Many players try for a team; coaches score them; the final roster blends speed, defense, and strategy.

Before vs. After:

Before: One long plan, often bloated with places, ignoring distance, and easily breaking rules.
After: Filter first to compact, on-theme choices; explore many specialized plans; pick the best parts; then use learning to first pass rules and later optimize comfort and style.

Why It Works (intuition without equations):

Good inputs matter: Filtering and clustering make the candidate list smaller, smarter, and closer together on the map.
Many minds beat one: Multiple specialized agents surface trade-offs early (culture vs. food vs. budget vs. distance).
Rules-first learning: A gentle “gate” tells the learner to care about soft preferences only after hard rules are satisfied, preventing silly mistakes like visiting closed places.

Building Blocks (each with a Sandwich):

New Concept 3 — PReSO (Personalized Recall and Spatial Optimization) 🍞 Hook: Imagine choosing toys from a giant store, but you only bring home what fits your backpack and matches your tastes. 🥬 The Concept:

What it is: A three-step workflow that extracts your preferences, recalls the right POIs, and groups them by location.
How it works: (1) Build a user profile (explicit + inferred). (2) Recall via semantic match, top landmarks, and LLM suggestions. (3) Cluster POIs so days are compact.
Why it matters: Better inputs mean shorter routes and less confusion. 🍞 Anchor: If you love museums and noodles, PReSO surfaces nearby museums and noodle spots instead of random faraway places.

New Concept 4 — CCoT (Competitive Consensus Chain-of-Thought) 🍞 Hook: You know how a group project is better when each teammate has a role and they vote on the best ideas? 🥬 The Concept:

What it is: Many persona-agents propose day plans, review each other, then a referee fuses the top picks into one.
How it works: (1) Instantiate agents (e.g., culture, food, budget). (2) Generate plans in parallel. (3) Score for diversity and quality; fuse winners.
Why it matters: It balances multiple goals and avoids tunnel vision. 🍞 Anchor: The foodie pushes tastier lunches, the budgeter keeps costs steady, and the historian ensures museum time fits the hours.

New Concept 5 — Constraint-Gated Reinforcement Learning 🍞 Hook: Think of it like a school where you must pass safety training before you get to decorate the classroom. 🥬 The Concept:

What it is: An RL stage with a sigmoid “gate” that turns on soft-preference rewards only after hard rules are passed.
How it works: If hard-rule score is low, the gate is near zero (focus on rules). Once the score is high enough, the gate opens and preferences count more.
Why it matters: Prevents the model from chasing comfort while breaking basic rules. 🍞 Anchor: First ensure museum times and no duplicates; then improve cuisine variety and walking distance.

New Concept 6 — GSPO (Group Sequence Policy Optimization) 🍞 Hook: Imagine judging several essays at once, then giving feedback that compares them fairly. 🥬 The Concept:

What it is: A way to optimize the AI’s whole-plan choices using groups of plan samples per prompt.
How it works: Generate multiple itineraries; compare their rewards; update the policy to favor the better ones while keeping training stable.
Why it matters: It learns sequence-level planning rather than just word-by-word tricks. 🍞 Anchor: If Version B of a day plan obeys rules and has a nicer route than Version A, GSPO pushes the model to write more like B.

03Methodology

At a high level: User Query → PReSO (Profile + Recall + Clustering) → CCoT (Agents propose + Review + Consensus) → Constraint-Gated RL (Refine) → Final Itinerary.

Step 1: PReSO (Personalized Recall and Spatial Optimization)

What happens: The system reads your request (e.g., “4 days, culture + food, medium budget”), extracts explicit needs (dates, budget, cities), and infers implicit tastes (hotel class, meal price range). Then it recalls candidate POIs three ways—semantic match to your query and keywords, famous landmarks (e.g., 4A+ attractions), and LLM-suggested matches. Next, it clusters POIs geographically (DBSCAN) so days can be planned in tight areas.
Why this step exists: Without pruning and clustering, the plan zig-zags across town and overwhelms the model’s memory.
Example: If you’re visiting Xi’an for 4 days, PReSO might recall Shaanxi History Museum, City Wall, nearby noodle restaurants, and hotels near those clusters—rather than mixing in far suburban spots.

New Concept 7 — User Profile Construction 🍞 Hook: You know how a tailor measures you before sewing a jacket? 🥬 The Concept:

What it is: Extract explicit info (dates, cities, budget) and infer hidden preferences (hotel class, meal range) from the query and city stats.
How it works: The model reads your text and city price data to guess a good hotel tier and meal budget.
Why it matters: The rest of the pipeline depends on knowing your style and limits. 🍞 Anchor: “Budget ¥4000, 4 days” → likely Midscale hotel, meals around a reasonable per-day range.

New Concept 8 — Multi-Dimension POI Recall 🍞 Hook: Picking a playlist using search, charts, and friend suggestions. 🥬 The Concept:

What it is: Three channels: semantic similarity, top-rated landmarks, and LLM supplementation.
How it works: (1) Embedding search for relevance, (2) include famous anchors, (3) LLM adds good fits you might miss.
Why it matters: Increases the chance you don’t miss must-see places. 🍞 Anchor: It won’t forget the top museum, but can still add a lesser-known park that matches your vibe.

New Concept 9 — Spatial Clustering (DBSCAN) 🍞 Hook: Sorting puzzle pieces by color before assembling. 🥬 The Concept:

What it is: Group nearby POIs so days are compact.
How it works: Density-based clustering forms location groups; cluster labels are attached to POIs, restaurants, and hotels.
Why it matters: Minimizes back-and-forth trips. 🍞 Anchor: Museum + city wall + noodles in one cluster; mountain + hot spring in another.

Step 2: CCoT (Competitive Consensus Chain-of-Thought)

What happens: The system creates 4–6 persona-agents (e.g., culture, foodie, budget, route-efficiency). A general skeleton for the day is drafted, then each agent refines it to fit its goal. All proposals pass basic rule checks. Next, proposals are compared for diversity (unique angles get more weight). Each agent scores and critiques others. Finally, top-k proposals are fused into one daily plan while respecting geographic order.
Why this step exists: One path can’t see all trade-offs. Competition and review balance goals.
Example: The foodie suggests a top local spot near the museum; the budgeter suggests a cheaper but close option; the arbiter picks the best combo that stays within time windows.

New Concept 10 — Agent Instantiation 🍞 Hook: Assigning roles in a school play. 🥬 The Concept:

What it is: Define each agent’s identity, measurable objective, and ranked priorities.
How it works: Culture agent maximizes museum hours; foodie maximizes cuisine quality within price; budgeter minimizes total spend.
Why it matters: Clear roles create clear, comparable plans. 🍞 Anchor: For “culture + gourmet + limited budget,” you might get Historian, Food Blogger, and Budget Manager agents.

New Concept 11 — Parallel Proposal Generation 🍞 Hook: Everyone writes their version of the essay at the same time. 🥬 The Concept:

What it is: From a base route skeleton, each agent independently crafts a day plan optimized for its goal.
How it works: They use only allowed POIs, obey time windows, and produce a complete day schedule.
Why it matters: Yields diverse, strong options. 🍞 Anchor: One plan spends longer at the museum; another shortens it to add a special lunch.

New Concept 12 — Proposal Diversity Weighting 🍞 Hook: Rewarding kids who bring new ideas to group work. 🥬 The Concept:

What it is: Give higher weight to proposals that are less similar to others.
How it works: Compute similarities; unique plans get bigger multipliers.
Why it matters: Prevents the final plan from being a bland average. 🍞 Anchor: A creative but feasible route gets noticed and preserved.

New Concept 13 — Parallel Peer Review 🍞 Hook: Classmates grade each other’s drafts. 🥬 The Concept:

What it is: Each agent scores others based on its objective and flags feasibility issues.
How it works: Produces a score matrix and short critiques.
Why it matters: Surfaces trade-offs and errors before merging. 🍞 Anchor: Foodie flags a great meal that’s too far; route agent suggests a closer spot.

New Concept 14 — Weighted Consensus Selection 🍞 Hook: A fair referee picks the best mix. 🥬 The Concept:

What it is: Combine diversity weights with peer scores to rank proposals; fuse top-k into one itinerary.
How it works: Keep geographic sequence, pick best POIs per slot, ensure timing and meals fit rules.
Why it matters: Get a balanced, expert-like daily plan. 🍞 Anchor: Morning from Plan A (best museum timing), lunch from Plan C (tastiest near-by), afternoon from Plan B (short walk).

Step 3: Constraint-Gated RL (Refinement)

What happens: The draft plan is fine-tuned with a reward that first emphasizes feasibility and rationality. A sigmoid gate increases the influence of soft preferences only after hard rules look good.
Why this step exists: A naive sum of rewards often tanks rule-following; the gate avoids that pitfall.
Example: Once opening hours and no-duplicates are consistently correct, it starts improving budget fit and route compactness.

New Concept 15 — Sigmoid Gate for Rewards 🍞 Hook: A dimmer switch that brightens only after you’ve safely wired the lamp. 🥬 The Concept:

What it is: A smooth function that scales soft-reward strength based on the hard-rule score.
How it works: Below threshold → near-zero weight on soft rewards; above → quickly ramps up.
Why it matters: Creates a curriculum: rules first, polish second. 🍞 Anchor: The plan won’t pick a dreamy dinner if it causes a late, illegal visit.

New Concept 16 — GSPO Training 🍞 Hook: Comparing a set of attempts at once makes judging fairer. 🥬 The Concept:

What it is: Generate several itineraries per prompt, compute grouped advantages, and update the policy stably.
How it works: Sequence-level learning focuses on whole-plan quality, not token tricks.
Why it matters: Improves real planning behavior. 🍞 Anchor: If two of eight samples are excellent, the model shifts toward those styles.

Secret Sauce:

Smart inputs (PReSO) + many specialized minds (CCoT) + a rules-first learning gate (RL) = realistic, compact, and personalized plans that generalize across different LLMs.

04Experiments & Results

The Test (What and Why):

Environment: TripTailor sandbox (40 cities; static, rich data). This avoids the randomness of live web data and makes fair comparisons.
Metrics:
1. Feasibility Pass Rate: Are there hallucinations or missing basics?
2. Rationality Pass Rate: Do meals, durations, and hours make sense (Micro and Macro)?
3. Average Route Distance Ratio: Is the route compact vs. real expert plans (lower is better)?
4. Final Pass Rate: Must pass feasibility + rationality, and not exceed 1.5× route length of reference.
5. Final Surpassing Rate: LLM-as-a-judge—does it match or beat human plans in personalization?

The Competition (Baselines):

Direct Planning: One-shot itinerary generation with an LLM.
ReAct Planning: Reason-then-act with tool use.
TripTailor Workflow: A strong structured pipeline.

The Scoreboard (With context):

100% Feasibility across all tested backbones: Like scoring a perfect “no rule breaks” on every test.
Macro Rationality soars to 88%+ (often 90%+): Baselines struggled to pass all rationality rules together (many under 30%); TourPlanner clears them like moving from a C- to an A.
Route Efficiency: Average route distance ratio drops to about 2.15 from as high as 5.98 (Direct + GPT-4o baseline). That’s like cutting zig-zag walking by more than half.
Final Surpassing Rate: Up to 30.2%. This means the system frequently matches or beats the personalization quality of human itineraries.
PReSO Recall Gains: With GPT-4o, PReSO recall reaches 42.26% vs. 27.83% for TripTailor—finding far more of the true relevant items.

Surprising Findings:

Model-Agnostic Strength: Whether using GPT-4o or open-source models like Qwen and DeepSeek-R1, TourPlanner’s gains hold. This shows the framework design, not just one giant model, drives success.
Sweet Spot in Agent Count: 4–6 agents yield the best balance. Fewer reduces diversity; too many causes diminishing returns and slight regressions.
RL Matters—but Only with the Gate: Vanilla RL (just adding hard + soft rewards) underperforms. The gate is essential to keep rule-following strong while improving comfort and personalization.

Ablation Highlights:

Remove CCoT: Macro rationality drops and Final Pass Rate falls, proving the consensus debate is key.
Direct Refine without RL: Worse across metrics—learning is needed to generalize improvements.
Constraint-Gated RL vs. Vanilla RL: The gated version wins clearly on rule consistency and final outcomes.

Takeaway: The trio—PReSO inputs, CCoT consensus, and gated RL—works together like a well-coached team, each part covering the others’ weaknesses.

05Discussion & Limitations

Limitations:

End-to-End RL is hard: The CCoT process plans day by day; rewards really depend on the whole trip. Designing a perfect process reward is tricky, so full-trip RL remains an open challenge.
Reward Model Scope: The preference model follows prior work. Deeper alignment with real user tastes could lift the surpassing rate further.

Required Resources:

Data: A structured city sandbox (like TripTailor) with transport, attractions, restaurants, and hotels.
Models: An LLM capable of long-context planning; optional open-source backbones are supported.
Compute: RL fine-tuning benefits from multi-GPU clusters; smaller-scale deployments can skip RL and still gain from PReSO + CCoT.

When NOT to Use:

Rapidly changing data: If opening hours or transport schedules shift hourly and you lack stable updates, the sandbox assumptions may mislead.
Ultra-short queries: If the user gives no preferences and a minimal budget/time window, the extra machinery may be overkill.
Real-time emergencies: For last-minute disruptions (storms, cancellations), a reactive tool+search agent may be more suitable than debate-style planning.

Open Questions:

Can we design trip-long rewards that teach pacing and variety across multiple days without micromanaging each hour?
How can user feedback loops (thumbs up/down on meals, museums, walking time) refine preference alignment mid-planning?
Could the agent team self-organize (add or retire roles) based on detected conflicts (e.g., a new ‘kid-friendly’ agent when children are present)?
Can we integrate formal solvers (for timing/transport) and still keep the creative, personalized feel?
How do we adapt to real-time data safely while preserving reproducible evaluations?

06Conclusion & Future Work

Three-Sentence Summary: TourPlanner builds better trips by first picking the right, nearby places (PReSO), then letting multiple expert agents propose and debate daily plans (CCoT), and finally training with a rules-first reward gate (Constraint-Gated RL). This combination dramatically boosts feasibility, rationality, and route efficiency across different language models. The framework shows that planning quality is about smart inputs, diverse reasoning, and disciplined learning.

Main Achievement: It reliably balances hard rules and soft preferences by orchestrating multi-path consensus and a gated learning curriculum, achieving state-of-the-art results in a rigorous benchmark.

Future Directions:

Create whole-trip reward designs that teach multi-day pacing and diversity.
Personalize deeper with richer feedback and preference models.
Blend formal verification with creative LLM planning for both guarantees and charm.

Why Remember This: TourPlanner demonstrates a general recipe for complex planning: curate inputs, invite specialized debate, and train with rules first, polish second. That playbook can help beyond travel—anywhere we balance strict constraints with human comfort and taste.

Practical Applications

•Personal travel apps that produce reliable, taste-matched daily itineraries with minimal backtracking.
•Tour operators generating customized group tours that honor time windows and budgets.
•City tourism boards offering themed routes (history, food, nature) optimized for walking distances.
•School trip organizers ensuring safety rules (hard constraints) while enriching learning experiences.
•Conference and event logistics planning (venues, meals, shuttles) with compact routing.
•Accessible travel planning that respects specific time windows and proximity needs.
•Last-minute re-planning when a venue closes, swapping in nearby alternatives automatically.
•Multi-city backpacking routes that balance costs, transport schedules, and must-see stops.
•Culinary-focused trips where the foodie agent weighs authenticity, price, and proximity.
•Budget-sensitive family vacations that keep spending in check while maximizing fun.

Version: 1