LEO-RobotAgent: A General-purpose Robotic Agent for Language-driven Embodied Operator

Lihuang Chen; Xiangyu Luo; Jun Meng

LEO-RobotAgent: A General-purpose Robotic Agent for Language-driven Embodied Operator

Intermediate

Lihuang Chen, Xiangyu Luo, Jun Meng12/11/2025

arXiv PDF

Key Summary

•LEO-RobotAgent is a simple but powerful framework that lets a language model think, plan, and operate many kinds of robots using natural language.
•Instead of hard-coding separate programs for each task and robot, it gives the AI a toolbox it can pick from, like flying a drone, detecting objects, or moving a robot arm.
•The agent runs in a safe loop: think → choose a tool → act → observe → adjust, and it keeps a history so it learns what just happened.
•Humans can jump in at any time to correct, guide, or change the plan, making teamwork between people and robots easier and safer.
•Prompting tricks like Chain-of-Thought and one-shot examples make the robot’s plans more reliable and faster, especially in tricky searches.
•In tests, the same framework worked on drones, robot arms, and wheeled robots, and even transferred from simulation to the real world with high success rates.
•Compared to more complicated multi-LLM agent designs, LEO-RobotAgent’s streamlined single-LLM core was more robust, easier to debug, and used fewer tokens.
•A UAV object-search task succeeded 90% in simulation and 70% in the real world, with failures mostly due to control and perception inaccuracies, not the planner.
•The system is built on ROS with a web dashboard, so users can register tools, monitor runs, and interact with the agent visually without deep coding.
•The main limitation is that today’s LLMs still struggle with 3D spatial common sense, so extra guidance is needed for precise real-world actions.

Why This Research Matters

Robots that can understand plain language and adapt on the fly will be far more helpful in homes, hospitals, warehouses, and cities. A single framework that works across drones, arms, and mobile bases cuts costs and speeds up deployment. Human-in-the-loop control makes these systems safer and easier to trust, because people can correct or redirect instantly. Clear prompting and modular tools turn complicated engineering into turn-the-crank setup, lowering the barrier for small teams and schools. Sim-to-real transfer means we can safely prototype in simulation and move to real hardware without starting over. By proving that a simple loop with one LLM can outperform heavier designs, this work points the way to practical, scalable robot assistants. Over time, better 3D reasoning will make these agents even more reliable in the messy real world.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how kids learn chores at home one by one—first set the table, then fold laundry, then take out the trash—and each chore gets its own little rulebook? Traditional robots were like that. Each job had its own custom program, and adding a new job meant writing a whole new rulebook. As chores got more detailed, the rulebooks became bulky and fragile, and tiny changes could break everything.

🍞 Hook: Imagine you’re the captain of a team, but every player can only do one move and needs a brand-new set of instructions for every game. 🥬 The Concept (Large Language Models, LLMs): LLMs are computer programs that understand and generate human language.

How it works: (1) You tell it a goal in plain words. (2) It reasons through steps. (3) It outputs actions or instructions. (4) It adjusts based on feedback.
Why it matters: Without LLMs, we must write many rigid rules. With them, we replace piles of special-case logic with flexible reasoning. 🍞 Anchor: Ask, “Find the trash bin and fly above it.” An LLM can break that into search → detect → navigate → hover.

Before this research, people started plugging LLMs into robots for planning, but usually in one robot type and one task at a time. Some works made the LLM spit out code to run simple steps; others used complex multi-agent systems with many LLM roles. Those helped, but they were fragile, heavy, or hard to generalize across new robots and tasks. Debugging them felt like untangling a giant knot.

🍞 Hook: You know how two people learn to dance better when they talk clearly and can stop to help each other? 🥬 The Concept (Human-Robot Interaction): This is how humans and robots communicate and work together.

How it works: (1) Human gives a goal. (2) Robot explains its plan. (3) Human can correct or guide mid-task. (4) Robot updates and continues.
Why it matters: Without good interaction, small misunderstandings turn into big mistakes. 🍞 Anchor: “Pause. That’s the wrong bin. Aim for the blue one.” The robot changes course right away.

🍞 Hook: Think of a to-do list before cleaning your room. 🥬 The Concept (Task Planning): Turning a goal into an ordered set of steps the robot can actually do.

How it works: (1) Understand the goal. (2) Break it into steps. (3) Pick tools to do each step. (4) Execute and revise.
Why it matters: Without planning, the robot might do steps in the wrong order or miss important checks. 🍞 Anchor: “Find, fly, verify, hover, report” is a plan for “Go above the trash bin.”

People tried three main directions—and ran into walls:

Direct action lists from LLMs: fast, but no feedback loop if something goes wrong.
LLM-generated programs: powerful, but need careful review and can’t do language-heavy steps well.
Multi-LLM architectures: ambitious, but complex to coordinate, costly in tokens, and hard to debug.

The missing piece was a simple, general framework that lets one LLM think, plan, act, and adjust across many robots and tasks, while still welcoming human guidance and easy tool plug-ins.

🍞 Hook: Imagine a Swiss Army knife for robots. 🥬 The Concept (General-Purpose Robotic Agent Framework): A system that lets robots handle many different jobs using one clear structure.

How it works: (1) Take a language goal. (2) Reason a step. (3) Choose a tool. (4) Act. (5) Observe. (6) Repeat until done.
Why it matters: Without it, each new job or robot needs custom code. 🍞 Anchor: The same framework controls a drone, a robot arm, or a wheeled base to do different chores.

The stakes are real: safer drones that understand instructions, home robots that can help families, and factory robots that learn new tasks quickly without rewriting software. That’s what LEO-RobotAgent aims to unlock: robust, flexible robot help in everyday life.

02Core Idea

The “Aha!” in one sentence: Give one capable language model a clear loop and a plug-in toolbox so it can think, act, and adjust—on any robot, for many tasks—with humans able to guide at any time.

Three analogies for the same idea:

Coach and playbook: The LLM is the coach calling plays each turn, picking the right tool (player) for the situation, then watching the field and updating the next play.
Chef and kitchen: The LLM-chef reads the order, designs the recipe step-by-step, grabs the right tools (pan, oven), tastes (observation), and tweaks seasoning (plan) until it’s perfect.
GPS with rerouting: You give a destination; the agent plots a route, drives, senses traffic, and reroutes as needed, while you can still tap “avoid highways” mid-trip.

Before vs. After:

Before: Many special programs, one-per-task; fragile handoffs between planning and action; or heavy multi-LLM teams that were hard to coordinate.
After: One streamlined loop: reason → choose tool → act → observe → reflect. The same structure works on drones, robot arms, and wheeled robots. Humans can interrupt and guide. Tools are modular and easy to register.

Why it works (intuition):

Coherence: One mind (a single LLM) keeps the whole plan in its head, so fewer misunderstandings.
Feedback: Observations after each action prevent the agent from marching off a cliff.
Modularity: Tools turn complex abilities (fly, detect, grasp) into simple buttons the LLM can press with parameters.
Human-in-the-loop: People can correct early, saving time and avoiding failures.
Prompt scaffolding: Reasoning prompts (like Chain-of-Thought) and examples (one-shot) boost reliability.

Building blocks, each in the Sandwich format:

🍞 Hook: You know how giving a good recipe helps a friend cook better? 🥬 The Concept (Prompt Engineering): Writing instructions that guide the LLM to reason clearly and act correctly.

How it works: (1) State the role and rules. (2) Require structured output (JSON). (3) Add examples (one-shot). (4) Ask for step-by-step reasoning (CoT).
Why it matters: Without good prompts, the LLM can be vague, skip steps, or call the wrong tool. 🍞 Anchor: “First, explain your plan; then choose exactly one tool and give its inputs as JSON.”

🍞 Hook: Think of a toolbox with labeled drawers: wrench, hammer, drill. 🥬 The Concept (Toolset Module): The list of callable robot tools the LLM can use.

How it works: (1) Each tool has a name, function, input/output description, and availability flag. (2) The agent picks a tool and passes parameters. (3) The tool returns an observation.
Why it matters: Without a clean toolset, the LLM can’t act in the world, only talk about it. 🍞 Anchor: "uav_fly(x,y,z,yaw)" moves a drone; "object_detection()" returns detected items and positions.

🍞 Hook: Practice on a simulator like a flight game before flying a real drone. 🥬 The Concept (Simulation-to-Real Transfer): Applying what works in simulation to the physical world.

How it works: (1) Test the agent-loop in sim. (2) Use the same tools on the real robot. (3) Expect differences from sensors/control and adjust.
Why it matters: Without this, you’d risk real hardware on untested plans. 🍞 Anchor: A drone that found a bin in sim also found it in reality 70% of trials, with misses due to control/localization noise.

🍞 Hook: A team where each player has a role—planner, doer, checker. 🥬 The Concept (Multi-Agent Architecture): Splitting tasks among several LLMs with different jobs.

How it works: (1) Planner proposes. (2) Actor executes. (3) Evaluator critiques. (4) They loop together.
Why it matters: Can add structure, but also extra complexity, tokens, and miscommunication. 🍞 Anchor: In tests, multi-LLM teams sometimes over-replanned or missed details; the single-LLM loop was steadier.

Put together, LEO-RobotAgent keeps the core simple—one LLM, tight loop, modular tools, human guidance—so it’s easier to trust, scale, and reuse across robots and tasks.

03Methodology

At a high level: Natural-language Task → LLM reasons a step (Message) → Chooses a Tool (Action + Action Input) → Tool executes → Observation → History update → Repeat or Finish.

Step-by-step recipe with the Sandwich pattern where new ideas appear:

LLM configuration and structured outputs

What happens: The system prompt sets strict rules so the LLM always replies in JSON with three fields: Message (its reasoning/plan), Action (tool name), and Action Input (parameters). It must write the Message first, then choose exactly one tool.
Why this step exists: Without structure, the LLM might ramble, skip reasoning, or try multiple tools at once, causing chaos for robots.
Example: {"Message": "Rotate 90° and detect.", "Action": "object_detection", "Action Input": {}}.

Toolset module

What happens: Each tool is registered with name, function, input/output schema, description, and on/off availability for the current task. Tools wrap robot capabilities like navigation, manipulation, perception, audio/speech, RAG, and even helper LLMs/VLMs.
Why this step exists: Tools translate plans into real actions. Missing or poorly described tools lead to wrong calls or bad parameters.
Example: uav_fly(x,y,z,yaw) → returns {"status":"arrived","pose":...}; object_detection() → returns {"objects":[{"type":"trash_bin","pos":[-0.24,-3.04,0.48]}]}.

The closed-loop cycle and history

What happens: The agent iterates: reason → act → observe. It stores the conversation, chosen tools, inputs, and tool observations as History. The LLM reads this to stay consistent.
Why this step exists: Without a loop, it’s open-loop guessing. Without history, it forgets what just happened or repeats steps.
Example: After rotating and detecting, the observation lists a trash bin at (-0.24,-3.04). Next step: fly there and adjust yaw.

Human-robot interaction

What happens: Users can provide initial task details (goal, initial state, scenario), examples, and safety notes. During execution, they can pause to correct, redirect, or change the goal.
Why this step exists: Real tasks change. Humans ensure safety and speed by correcting early.
Example: “That’s the wrong bin—choose the blue one,” or “Add: drop the ball into the bin.”

Application system (ROS + Web UI)

What happens: The agent and tools are ROS nodes; messages flow over ROS topics (via RosBridge); video streams via a VideoServer; a Web UI shows chat, tool feedback, and controls for tool registration and node start/stop over WebSocket.
Why this step exists: Robotics needs reliable, long-lived messaging and an easy ops console. Without it, setup and debugging are hard.
Example: From the browser, a user registers a new perception tool, monitors detections, and restarts a node that crashed.

Prompt engineering inside the loop

What happens: The system prompt defines role, JSON format, tool descriptions, and examples (one-shot). Chain-of-Thought is encouraged to show reasoning before choosing actions.
Why this step exists: CoT reduces sloppy thinking; examples show the style and safe patterns. Without them, plans get brittle.
Example: Room search: with one-shot + CoT, the UAV explores thoroughly yet efficiently instead of staring at walls.

Secret sauce: less is more

The clever part is not adding more LLMs or complex supervisors, but giving one strong LLM a clean loop, good tools, visible history, and human help. This coherence lowers hallucinations, token use, and debugging cost, while keeping generality high across robots and tasks.

Concrete walk-through with actual data (UAV task):

Input: “Search for nearby objects indoors, list them, then fly above the trash bin. After I confirm, drop the ball into it.”
Step 1: Message: “Rotate in 90° increments and run detection 4 times to cover room.” Action: object_detection(). Observation: person(-5.47,-1.42,0.60), trash_bin(-0.24,-3.04,0.48), ...
Step 2: Message: “Navigate above trash bin at 1.5 m altitude.” Action: uav_fly(-0.24,-3.04,1.5,0). Observation: {"status":"arrived"}.
Human: “Confirmed. Drop now.”
Step 3: Message: “Activate drop mechanism.” Action: pwm_control({"channel":X,"value":...}). Observation: {"status":"released"}.
Finish: Report success, location, and video snapshot.

Safety and failure handling:

Require the agent to verbalize checks in Message (e.g., confirm target ID, estimate clearance, adjust yaw). If Observation conflicts, re-sense or choose a safer action. Humans can pause to correct.

Performance tuning tips:

Register only needed tools (reduce confusion). Provide one-shot examples and short CoT. Enforce JSON schema strictly. Keep messages concise but clear. Use history summaries for long runs.

04Experiments & Results

What they measured and why:

Can one simple agent framework plan and act across multiple robot types and tasks? Does human interaction help? Do prompts like CoT and one-shot improve planning? How does the streamlined agent compare to other popular agent architectures? They tracked success rates, task time, and token usage.

The tests:

Feasibility on UAVs (simulation and real): A drone searched for a specified object via detection, then flew above it. In the real task, it also dropped an iron ball into a container. This checked the end-to-end loop, human interaction, and sim-to-real transfer.
Prompt experiment: Two scenarios—room-scale indoor search (find as many identifiable objects as possible) and city-scale search (find a pavilion using a VLM). Compared zero-shot, one-shot, CoT, and one-shot+CoT for speed, reliability, and tokens.
Agent architecture comparison: On a simulated wheeled robot with a small arm, three tasks of rising difficulty: Delivery (pick-and-place three bottles to three targets, any order), Searching (find nearest bottle using perception, pick, return), and Handover (find person A, get a natural-language subtask, fetch a bottle, place it near person B, return). Compared five schemes: DAS (direct sequence), CGE (code-generating), DLLMs (planner+evaluator), TLLMs (planner+actor+evaluator), and LEO (ours).

Competition and scoreboard (in friendly terms):

UAV feasibility: 9/10 success in simulation (A-level), 7/10 in real flights (solid B+), with real failures mostly from control accuracy and object localization—not the LLM’s reasoning.
Prompts: In both room and city tasks, one-shot and CoT each helped; together they did best overall. One-shot often finished fastest when successful (good “floor”), while CoT cost more tokens and time but produced more thoughtful, thorough search patterns (good “ceiling”).
Agent architectures: For the simple Delivery task, one-time generation (DAS, CGE) performed very well with low cost—great for clear, short jobs. For more complex, perceptual, and language-heavy tasks (Searching, Handover), LEO’s streamlined single-LLM loop outscored or matched others while using tokens and time reasonably. Multi-LLM teams (DLLMs, TLLMs) suffered from coordination overhead, hallucinations, and unnecessary replanning, hurting reliability and inflating costs.

Numbers with context:

UAV feasibility: Simulation 90% vs. Real 70% success; that’s like doing 9 correct dry runs in a video game and still getting 7 right in the real stadium, with misses due to bumpy turf (controls) rather than bad play calls (planning).
Prompts (selected highlights): One-shot made successful runs brisk; CoT improved coverage but consumed more tokens/time; combined one-shot+CoT typically delivered the best planning quality, covering corners and avoiding blind spots.
Agent comparison (simplified takeaways):
- Delivery: DAS/CGE ≈ top scores, minimal tokens/time. LEO ≈ top-tier too, slightly higher cost, still strong and robust.
- Searching: LEO led with higher scores and success times than others; TLLMs/CGE trailed; DAS couldn’t run (no perception loop).
- Handover: Only DLLMs, TLLMs, and LEO could run (needs NLU). LEO achieved the highest score and much better reliability.

Surprises:

Adding more LLMs didn’t guarantee better results; it often introduced more chatter, confusion, and cost. The single-LLM loop’s coherence paid off.
One-shot examples sometimes outpaced deep CoT reasoning in speed while still being accurate—great when you have a good template plan.
Sim-to-real gaps were dominated by tool/control precision, suggesting the agent logic transfers well if the hardware stack is solid.

05Discussion & Limitations

Limitations:

3D spatial common sense is still hard for current LLMs. Without guidance, they may choose awkward viewpoints, forget to face targets, or misjudge distances and clearances. They also rely on perception tools that can mis-detect or localize poorly.
Timing and low-level control aren’t handled by the LLM; you still need reliable controllers, safety checks, and calibrated sensors.
Very long-horizon tasks can bloat history; the agent needs summarization or memory tools to stay focused.

Required resources:

A capable LLM or VLM endpoint; ROS-based robotics stack; registered tools for control and perception; and a web UI or similar ops surface. For real robots, add precise localization, safe controllers, and failsafes.

When not to use:

Ultra time-critical, high-frequency control (e.g., millisecond-level stabilization) should stay in classical controllers, not through LLM loops.
Environments with unreliable perception where the agent can’t meaningfully observe progress.
Tasks that demand strict, certifiable determinism without room for language-driven variation.

Open questions:

How to imbue LLMs with stronger spatial priors and 3D reasoning, especially under uncertainty?
Best practices for automatic prompt adaptation during long tasks—can the agent learn to improve its own instructions?
How to quantify and reduce hallucinations in tool selection and parameterization?
How to blend classical planning (TAMP) and LLM reasoning in a principled way?
Memory: What are robust, token-efficient strategies for long-horizon summaries and retrieval in the loop?

06Conclusion & Future Work

In three sentences: LEO-RobotAgent is a streamlined, general-purpose framework that lets a single language model plan, act, observe, and adjust across many robots and tasks, while keeping humans in the loop. By enforcing structured outputs, providing a modular toolset, and running a tight think–act–observe cycle, it achieves robust performance that transfers from simulation to reality. Compared to heavier multi-LLM systems, it’s easier to debug, uses fewer tokens, and proves more reliable on complex tasks requiring perception and natural language.

The main achievement: Showing that “less is more”—one coherent LLM with a clear loop and good tools can outperform or match more complex agent teams, generalizing across UAVs, arms, and mobile bases while enabling smooth human-robot collaboration.

Future directions: Strengthen 3D spatial common sense and uncertainty handling in the loop; integrate memory/summarization for very long tasks; refine sim-to-real by improving control and perception tools; and explore principled hybrids of LLM reasoning with classical planners.

Why remember this: It’s a practical recipe for turning language into reliable robot action across platforms. The framework proves you don’t need a maze of agents to get generality—you need a clean loop, good tools, and room for people to help. That simplicity can speed real-world deployments, from homes and hospitals to farms, factories, and cities.

Practical Applications

•Voice-to-action home assistance: Ask a home robot to tidy a room, and it plans, picks, and places items safely with human corrections.
•Hospital delivery: Nurses describe supply runs; the robot plans routes, avoids crowds, and reports status, with staff able to redirect mid-run.
•Warehouse picking: Natural-language tasks (“Pick 3 bottles, slot B3”) trigger precise grasp and place, with visual checks and re-tries.
•Construction site inspection: A drone surveys, detects hazards, and re-plans paths as workers guide it via short voice prompts.
•Campus security patrol: An autonomous UAV follows patrol patterns, queries a VLM for suspicious objects, and streams video to a dashboard.
•Farming scout: A rover inspects crops, detects pest signs, and flags areas for human review, adjusting routes on farmer feedback.
•Elder care reminders: A mobile assistant navigates, speaks reminders, fetches lightweight items, and asks clarifying questions when unsure.
•Disaster assessment: Teams describe goals; drones map areas, find landmarks, and relay findings while responders update priorities live.
•Classroom robotics labs: Students register simple tools and run language-driven tasks in sim, then transfer to hobby robots.
•Factory changeovers: Engineers describe new steps; the agent sequences motions, validates with sensors, and updates the plan without code rewrites.

Version: 1