Agentic Reasoning for Large Language Models

Tianxin Wei; Ting-Wei Li; Zhining Liu; Xuying Ning; Ze Yang; Jiaru Zou; Zhichen Zeng; Ruizhong Qiu; Xiao Lin; Dongqi Fu; Zihao Li; Mengting Ai; Duo Zhou; Wenxuan Bao; Yunzhe Li; Gaotang Li; Cheng Qian; Yu Wang; Xiangru Tang; Yin Xiao; Liri Fang; Hui Liu; Xianfeng Tang; Yuji Zhang; Chi Wang; Jiaxuan You; Heng Ji; Hanghang Tong; Jingrui He

Agentic Reasoning for Large Language Models

Intermediate

Tianxin Wei, Ting-Wei Li, Zhining Liu et al.1/18/2026

arXiv PDF

Key Summary

•This paper explains how to turn large language models (LLMs) from quiet students that only answer questions into active agents that can plan, act, and learn over time.
•Agentic reasoning is the big idea: connect thinking (plans) with doing (actions) and learning (feedback and memory).
•There are three layers: foundational skills (planning, tool use, search), self-evolving skills (feedback and memory to improve), and teamwork skills (multi-agent collaboration).
•Two ways to power these agents are in-context reasoning (smart prompting and workflows at test time) and post-training reasoning (training the model with fine-tuning or reinforcement learning).
•Agents get better by reflecting on mistakes, storing helpful memories, and verifying results with tools or tests.
•When many agents cooperate with clear roles and shared memory, they can solve bigger, messier tasks than one model alone.
•Real-world uses include web research, math and coding help, science discovery, healthcare support, and robotics.
•Benchmarks now test each skill separately (like tool use or memory) and also full applications (like web browsing or lab planning).
•Open challenges include making agents personal and safe, handling very long tasks, learning world models, and training many agents together.
•The main takeaway: reasoning should guide action, feedback, and collaboration so AI becomes more reliable and useful in changing environments.

Why This Research Matters

Agentic reasoning helps AI handle real-life tasks that change over time, like web research, coding with tests, and planning lab work. By planning, using tools, and checking results, agents become more trustworthy than simple one-shot chatbots. Memory lets them learn from experience, so they get faster and better with repeated tasks. Teams of agents can split complex jobs, debate, and verify, reducing mistakes. With clear validators and citations, agents can be more transparent and safer to use. As these systems mature, they can assist students, scientists, doctors, engineers, and everyday users more reliably. This shift turns AI from fluent talkers into careful doers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you ask a smart friend one question and they answer right away. That works for simple facts, but what if the world changes, and they need to look things up, try tools, remember past attempts, or ask other friends for help?

🥬 The Concept: Large Language Models (LLMs) used to answer questions in a single shot, like taking a photo. They were great at math problems and code puzzles that don’t move or change, but they struggled in lively, real-world situations where facts update, mistakes happen, and tasks need many steps. How it works (before this paper):

You give a prompt, the model predicts the next words. 2) Tricks like Chain-of-Thought help it show steps, but it’s still a one-pass reply. 3) No tools, no memory, no trying-again built in. Why it matters: Without the ability to act, check, and learn, models can sound confident yet get things wrong when the task needs fresh info, multi-step planning, or real-world verification. 🍞 Anchor: Asking “What’s 12×39?” is easy for a static model. But planning a week-long science project with changing requirements isn’t.

🍞 Hook: You know how a kid learns to ride a bike? They try, wobble, fix what went wrong, and try again. Just reading a book about bikes isn’t enough—they must act and learn.

🥬 The Concept: Chain-of-Thought (CoT) made models show their steps, which helped in math/code benchmarks, but it mostly worked in closed worlds where nothing changes. How it works: 1) The model writes steps. 2) Those steps are inside the answer. 3) There’s no true action like opening a webpage or using a calculator. 4) If a step is wrong, it rarely checks using the real world. Why it matters: CoT boosts clarity but doesn’t connect to actions or learning from outcomes. 🍞 Anchor: It’s like writing a recipe without tasting the soup while cooking.

🍞 Hook: Think of a GPS that can’t recalculate when you miss a turn—it just keeps telling you the old route.

🥬 The Concept: The problem researchers faced is that standard LLMs can’t adapt during a task. They don’t plan over many steps, use outside tools, verify results, or remember useful lessons. How it works: 1) The world provides new info. 2) A static model can’t act to fetch it. 3) Mistakes don’t get corrected unless a human intervenes. 4) Past experience isn’t stored for tomorrow. Why it matters: In real life—web research, software fixes, lab procedures—success needs planning, action, checking, and memory. 🍞 Anchor: Booking a trip means comparing prices, checking dates, re-planning when flights change—things a single one-shot reply can’t juggle well.

🍞 Hook: Picture a robot assistant that can plan, grab tools, check its work, and remember what worked last time.

🥬 The Concept: Agentic reasoning is the shift from just talking to doing: models plan, act with tools, and learn through feedback and memory, sometimes working as a team of agents. How it works: 1) Plan steps. 2) Act with tools (search, code, APIs). 3) Check results (validators, tests). 4) Store helpful memories. 5) Improve next time through reflection or training. Why it matters: This connects thinking and doing, creating systems that handle open-ended, changing tasks. 🍞 Anchor: A homework helper agent can search verified sources, use a calculator, cite evidence, store study tips, and get better each week.

🍞 Hook: You know how you can either be super smart in the moment or train hard beforehand? Both help you play better.

🥬 The Concept: Two modes power agents: in-context reasoning (clever prompting and workflows at test time) and post-training reasoning (training the model with supervised fine-tuning or reinforcement learning). How it works: 1) In-context: keep weights frozen, use prompts, examples, and structured workflows to plan and act. 2) Post-training: update weights so skills like tool use and planning become built-in habits. Why it matters: In-context is fast to deploy and flexible; post-training makes agents more robust and consistent. 🍞 Anchor: In-context is like using a great playbook during the game; post-training is like months of practice changing your reflexes.

🍞 Hook: Teams beat solo players when the task is big.

🥬 The Concept: Multi-agent collaboration splits work across manager–worker–critic roles with communication and shared memory, so the system can debate, verify, and specialize. How it works: 1) Assign roles. 2) Communicate reasoning in natural language. 3) Verify and revise. 4) Share memory across agents. 5) Coordinate to a final answer. Why it matters: Diversity of thinking reduces blind spots and boosts reliability. 🍞 Anchor: One agent plans a coding fix, another writes the patch, a third runs tests and reviews errors before merging.

The gap this survey fills: It organizes this fast-growing area into a clear roadmap—foundations (planning, tools, search), self-evolving improvements (feedback, memory), and teamwork (multi-agent)—across two optimization modes (in-context vs post-training). The stakes are real: better homework help, safer software updates, more reliable scientific exploration, and assistive healthcare tools that check and cite evidence. This is how we move from smart talkers to careful doers that keep learning.

02Core Idea

🍞 Hook: Imagine a thoughtful chef who not only writes a recipe but also cooks, tastes, fixes mistakes, and teaches the kitchen team for next time.

🥬 The Concept: The aha! moment is to make reasoning the boss: first plan, then act with tools, then learn from feedback and memory—sometimes with a whole team of agents. How it works:

Foundational skills: planning, tool use, and search let an agent decompose goals and take verified actions.
Self-evolving: reflection, validators, and memory store lessons and shape better future behavior.
Collective: multiple agents coordinate roles, debate, and check each other’s work.
Two modes: use in-context workflows at test time or train the model (SFT/RL) so skills become built-in. Why it matters: This unifies thinking, doing, and learning, turning brittle one-shot answers into adaptive, long-horizon problem solving. 🍞 Anchor: A research agent planning a literature review decides queries, searches academic sites, cites sources, stores summaries, and refines its approach over weeks.

Three analogies (same idea, different lenses):

Coach-and-player: The plan (coach) guides actions (player). After the game, video review (feedback) updates future plays (memory/training).
GPS-and-car: The map (reasoning) plans a route; the car moves (action); traffic feedback changes the route; the system remembers common detours.
Scientist-and-lab: The scientist hypothesizes (plan), runs experiments (act), checks results (validate), writes notes (memory), and coordinates with colleagues (multi-agent).

Before vs after:

Before: LLMs answered once, with no tools, no memory, no retries. Great for static quizzes; shaky for changing tasks.
After: Agents loop: plan → act → check → remember → improve. They coordinate across agents when needed and can internalize skills via training.

🍞 Hook: You know how walking a maze is easier if you can try paths, mark dead ends, and remember what worked?

🥬 The Concept: Why it works is simple: actions and feedback reduce guessing. Tools handle precise or up-to-date parts. Memory prevents repeating mistakes. Teams add diverse viewpoints. Training turns good habits into reflexes. How it works:

Internal thoughts pick next best action (not just next word).
Tools fetch facts, compute, or execute code to verify.
Validators (tests, rules) catch errors quickly.
Memory reuses helpful patterns and facts later.
Collaboration cross-checks and specializes. Why it matters: Each loop trims uncertainty and compounds learning, especially in long, messy tasks. 🍞 Anchor: Fixing a bug with tests gives fast, truthful feedback—retry until green tests pass, then store the fix pattern for future tickets.

Building blocks (sandwich style):

Planning 🍞 Hook: You don’t eat a whole pizza at once—you slice it. 🥬 What: Planning breaks a big goal into ordered steps. How: 1) Understand the goal. 2) Split into subtasks. 3) Order them. 4) Choose tools per step. 5) Adjust if things change. Why: Without it, the agent gets lost on long tasks. 🍞 Anchor: Plan a trip: pick dates, search flights, book, then check visa rules.
Tool use 🍞 Hook: Even superheroes need gadgets. 🥬 What: Tool use means calling calculators, web search, code runners, or APIs. How: 1) Decide if a tool is needed. 2) Pick the right one. 3) Form a valid call. 4) Read results. 5) Update the plan. Why: Without tools, the agent may guess or use stale info. 🍞 Anchor: Use a calculator for 17.5% tip on a $46.80 bill.
Search 🍞 Hook: Finding the right book in a giant library needs a strategy. 🥬 What: Search fetches the most useful information at the right time. How: 1) Decide if more info is needed. 2) Write a query. 3) Retrieve. 4) Check relevance. 5) Repeat until confident. Why: Without search, answers drift or lack evidence. 🍞 Anchor: Look up the newest climate report and quote exact numbers.
Feedback and memory 🍞 Hook: A diary helps you avoid old mistakes. 🥬 What: Feedback checks results; memory stores what mattered. How: 1) Validate (tests, rules, expert notes). 2) Reflect on errors. 3) Save useful steps or facts. 4) Reuse next time. Why: Without them, the agent keeps stumbling in the same places. 🍞 Anchor: Keep a “bug fix recipe” you reuse for similar errors.
Multi-agent teamwork 🍞 Hook: A band sounds best when each instrument plays its part. 🥬 What: Multiple agents take roles—manager, worker, critic—and talk to align. How: 1) Assign roles. 2) Share reasoning. 3) Disagree respectfully. 4) Verify. 5) Merge the best result. Why: Complex tasks benefit from specialization and debate. 🍞 Anchor: One agent drafts a policy, another checks fairness, a third ensures legal compliance.

03Methodology

At a high level: Input → Plan (think) → Act (tools/search) → Check (validate/reflect) → Remember (memory) → Output (or loop again).

We introduce key concepts in the order a sixth grader can follow, using the sandwich pattern where new ideas appear.

Input and goal setting 🍞 Hook: Starting a school project, you first ask: What’s the assignment exactly? 🥬 What: Goal setting turns a user request into a clear target and success rules. How: 1) Read the request. 2) Clarify missing pieces (ask questions). 3) Write down constraints (deadline, sources). 4) Define done-ness (what counts as success). Why: Without clear goals, plans wander. 🍞 Anchor: “Summarize three reliable articles on ocean warming with citations by Friday.”
Planning (in-context vs post-training) 🍞 Hook: You can either read a step-by-step guide now or train to do it from memory later. 🥬 What: Planning decides the sequence of steps; it can be orchestrated at test time (in-context) or baked into the model by training (post-training). How:

In-context planning: 1) Use templates/workflows (perceive → reason → act → verify). 2) Try search trees (try multiple branches). 3) Switch strategies if stuck.
Post-training planning: 1) Fine-tune on good plans (SFT). 2) Use RL so successful plans become more likely. 3) Internalize habits for later tasks. Why: In-context is fast and flexible; post-training makes planning robust and quicker at run time. 🍞 Anchor: A code-fixing agent tries two patch plans in parallel, tests both, picks the one that passes more tests, and later learns that style through RL.

Tool use 🍞 Hook: If a screw is stuck, you grab the right screwdriver, not your fingers. 🥬 What: Tools include web search, calculators, code executors, databases, and APIs. How: 1) Detect a need (e.g., exact math, fresh info). 2) Select a tool. 3) Call it with correct arguments. 4) Parse the output. 5) Update the plan. Why: Tools reduce guessing and anchor answers to evidence. 🍞 Anchor: For “What’s the current population of Tokyo (2026 estimate)?” the agent searches trusted sites, cites sources, and stores the link.
Search (agentic RAG) 🍞 Hook: Don’t carry every book—learn to find the right page when you need it. 🥬 What: Agentic search means the agent decides when to retrieve, what to retrieve, and how to use it mid-reasoning. How: 1) Ask: “Do I need more info?” 2) Issue a targeted query. 3) Pull results. 4) Check relevance and consistency. 5) Iterate or stop. Why: Static retrieval can miss key facts; dynamic search adapts to the question. 🍞 Anchor: A history agent answering about a 2025 treaty updates its search if early results conflict.
Checking with validators and reflection 🍞 Hook: In baking, you poke the cake to see if it’s done. 🥬 What: Validators (tests, rules, experts) and reflection (self-critique) catch mistakes and guide fixes. How: 1) Run tests (unit tests for code, citation checks for research). 2) If fail, analyze errors. 3) Revise steps. 4) Re-run until pass or timeout. Why: Without checks, errors hide under confident language. 🍞 Anchor: A SQL query is run against a sandbox; if it errors, the agent inspects the message and adjusts the schema or syntax.
Memory: write, organize, retrieve 🍞 Hook: A study notebook lets you study smarter next time. 🥬 What: Memory stores facts, successful plans, failures to avoid, and tool usage patterns. How: 1) Decide what to save (useful, general). 2) Summarize briefly. 3) Link to related items. 4) Retrieve when similar tasks appear. Why: Memory prevents re-learning the same lesson and supports long projects. 🍞 Anchor: After solving multiple regex bugs, the agent saves a cheat sheet of patterns and pitfalls.
Two optimization modes

In-context orchestration 🍞 Hook: It’s like using a recipe card during cooking. 🥬 What: Keep the model frozen and design prompts, roles, and workflows to guide it. How: 1) Provide examples and tool docs. 2) Use step-by-step prompts. 3) Add search trees and verifiers. 4) Cache and reuse helpful context. Why: Rapid to deploy; no training required. 🍞 Anchor: A web agent with a fixed LLM uses a browse–read–decide loop and DOM checks to avoid clicks on the wrong button.
Post-training optimization 🍞 Hook: It’s like practicing so much you can cook without looking at the card. 🥬 What: Train the model to internalize skills via SFT or RL. How: 1) Collect high-quality plan/tool/search traces. 2) Fine-tune to imitate. 3) Use RL with rewards (correct, verified, safe) to refine. 4) Distill into smaller models. Why: More stable habits, fewer prompt hacks, better generalization. 🍞 Anchor: A math agent trained with RL learns to call a calculator at the right step and to verify intermediate results.

Multi-agent collaboration 🍞 Hook: A relay race is faster when runners pass the baton smoothly. 🥬 What: Split roles (manager, worker, critic), add communication rules, and share memory. How: 1) The manager decomposes tasks. 2) Workers propose solutions. 3) Critics verify and suggest fixes. 4) Merge and finalize. 5) Store team takeaways. Why: Complex tasks need specialization and multiple perspectives. 🍞 Anchor: For a grant proposal, one agent drafts science aims, one edits clarity, one checks references and compliance.

Example end-to-end mini-recipe (actual data flow):

User: “Summarize 3 trustworthy sources on microplastic effects on fish and give 5 bullet takeaways with citations.”
Plan: Identify substeps (find sources → read → extract → summarize → cite).
Search/Tools: Use web search API; filter for government or peer-reviewed articles; fetch pages.
Check: Ensure each claim maps to a citation; run a plagiarism and fact consistency check.
Memory: Save the best sources, a query template that worked well, and a checklist for future literature tasks.
Output: Five bullets with inline citations and a reference list.

Secret sauce: Separate thinking from acting, add tools for evidence, use validators to catch errors, remember what matters, and (optionally) train the model so good behavior becomes natural. This loop turns static talk into adaptive, verifiable action.

04Experiments & Results

🍞 Hook: When you test a basketball team, you don’t just check shooting—you check passing, defense, and teamwork too.

🥬 What was tested: This survey reviews benchmarks that isolate each agent skill and end-to-end application tests. How it works:

Core-skill benchmarks: tool use (can the agent pick and call the right API?), search (can it retrieve and use evidence?), planning/memory (can it follow long workflows and remember?), and multi-agent (can roles coordinate?).
Application benchmarks: web agents (navigate sites reliably), robotics/embodied tasks (follow multi-step instructions), science agents (design or check experiments), healthcare/clinical QA (cite guidelines), math/code agents (solve, run, and verify). Why it matters: Clear tests reveal which parts are strong and which need work. 🍞 Anchor: A code benchmark might require editing a real repo and passing the project’s unit tests—no pass, no points.

The competition: Baselines include static LLMs (no tools, no memory), simple CoT solvers, and early single-agent systems. Newer agentic systems add dynamic search, validators, memory, or teams of agents. Comparisons show where each addition helps most.

Scoreboard with context (qualitative):

Tool use: Agents that plan-before-act and verify calls typically outperform zero-tool baselines, especially on math, code execution, and data lookup—like moving from a C to an A- when precise computation is needed.
Search: Dynamic, step-wise search (ask, retrieve, reflect) reliably beats single-shot retrieval, especially on multi-hop questions—similar to going from a B- to a solid A on evidence-heavy tasks.
Planning/memory: Structured workflows plus memory cut error cascades in long tasks—more like finishing the puzzle rather than quitting halfway.
Multi-agent: Role-based systems often improve accuracy and robustness over single agents, though too much chatter can slow things down—so the win depends on good coordination.
Post-training (SFT/RL): Internalizing skills tends to reduce flaky behavior and improve generalization beyond the prompt design—like practicing scales to play songs more cleanly.

Surprising findings and lessons:

Small-but-well-orchestrated systems can rival larger models that lack tooling and validation; brains plus tools can beat brute force.
Validator-driven retries (like unit tests) can dramatically raise success in code tasks even without explicit reasoning edits.
Memory quality matters more than memory size; messy, unrelated memories can hurt.
Multi-agent debates help until they don’t—over-communication can create loops or drift.
RL for tool/search behaviors often yields more robust decisions than imitation alone, especially in open-ended web settings.

Why these results make sense: Verifiable actions and feedback turn “best-guess” into “best-checked.” Structured plans prevent getting lost. Memory shrinks repeat mistakes. Collaboration adds second opinions. Training cements good habits. Put together, agentic systems are better suited for dynamic, long-horizon tasks than static LLMs.

🍞 Anchor: On a web benchmark, a ReAct-style agent that reads, clicks, checks, and retries typically completes more tasks reliably than a one-shot model that just guesses a final answer.

05Discussion & Limitations

Limitations (be specific):

Stability and cost: Multi-step planning, search, and validation raise latency and compute bills; long dialogues can overflow context windows.
Tool brittleness: Changing APIs or website layouts can break agents unless tool docs and wrappers stay current.
Memory hygiene: Saving too much or low-quality memories can cause confusion; deciding what to keep/delete is nontrivial.
Safety and trust: Tools can be misused; persuasive but wrong outputs can slip through if validators are weak; multi-agent chatter can amplify errors.
Evaluation gaps: Many benchmarks measure final answers more than process quality (faithfulness of reasoning, calibration, and action safety).

Required resources:

Reliable tool interfaces (APIs, sandboxes, calculators, code runners) with good docs and test harnesses.
Memory storage (vector DBs, graph stores) and retrieval pipelines with access control and provenance.
Validators (unit tests, rule checkers, simulators, fact checkers) tailored to the domain.
Compute budgets for multi-step runs and, if used, post-training via SFT/RL.

When NOT to use agentic systems:

Simple, static Q&A where a one-shot model is faster and accurate enough.
High-risk domains with weak or unavailable validators (e.g., no ground truth to check against) and strict safety requirements.
Ultra-low-latency settings (e.g., tight real-time control) when multi-step reasoning would miss timing.
Highly volatile UIs or sites without stable selectors—agents may click the wrong thing.

Open questions:

Personalization: How to align plans, tools, and memories with each user’s style and constraints safely?
Long-horizon credit assignment: Which earlier step deserves praise/blame and how do we adjust plans or memory accordingly?
World models: How can agents form compact, updateable models of environments to plan better with fewer real actions?
Scalable multi-agent training: How to train teams (not just individuals) to communicate efficiently and robustly?
Governance: How to audit actions, track provenance, enforce permissions, and prevent tool misuse in real deployments?

06Conclusion & Future Work

Three-sentence summary: This survey reframes LLMs as agents that plan, act with tools, and learn via feedback and memory, sometimes working together as teams. It organizes the field into three layers—foundational, self-evolving, and collective—and two optimization modes—in-context orchestration and post-training learning. The result is a practical roadmap for building systems that bridge thoughts and actions in dynamic, real-world settings.

Main achievement: A unified, reasoning-centered blueprint that explains how planning, tool use, search, feedback, memory, and multi-agent collaboration fit together, and how to power them with either smart prompting or training.

Future directions: Personal, safe agents that remember responsibly; stronger long-horizon skills with better credit assignment; world models that let agents plan ahead; scalable training for teams of agents; and solid governance for real deployment.

Why remember this: It marks the moment AI steps from being a great talker to a careful doer—one that plans, checks, and learns—so it can help with the complex, changing tasks we actually care about.

Practical Applications

•Homework research assistants that search, cite sources, and store reusable study guides.
•Software engineering agents that propose patches, run unit tests, and document fixes.
•Scientific literature reviewers that extract findings, compare studies, and plan follow-up questions.
•Healthcare QA assistants that pull guideline-backed answers and flag uncertainty for clinicians.
•Autonomous web agents that navigate pages, fill forms, and verify actions before submitting.
•Data analysis copilots that fetch datasets, run computations, plot results, and explain methods.
•Customer support triage bots that look up policies, call internal tools, and provide verified answers.
•Robotics task planners that break goals into sub-steps and validate success with sensors.
•Compliance and policy checkers that cross-reference requirements, annotate risks, and propose fixes.
•Team-of-agents workflows for complex writing (drafting, reviewing, fact-checking) with shared memory.

Version: 1