Reinventing Clinical Dialogue: Agentic Paradigms for LLM Enabled Healthcare Communication

Xiaoquan Zhi; Hongke Zhao; Likang Wu; Chuang Zhao; Hengshu Zhu

Reinventing Clinical Dialogue: Agentic Paradigms for LLM Enabled Healthcare Communication

Intermediate

Xiaoquan Zhi, Hongke Zhao, Likang Wu et al.12/1/2025

arXiv PDF

Key Summary

•Clinical conversations are special because they mix caring feelings with precise medical facts, and old AI systems struggled to do both at once.
•Big language models (LLMs) talk well, but they can guess instead of verify, forget past visits, and can’t use hospital tools by themselves.
•This paper explains a new agentic way to build medical AIs that plan steps, remember patients over time, use tools, work in teams, and learn from feedback.
•It introduces a simple map with two axes: where knowledge comes from (inside the model vs. trusted sources) and what the agent does (understand vs. act).
•These axes create four styles of agents: Latent Space Clinicians, Grounded Synthesizers, Emergent Planners, and Verifiable Workflow Automators.
•Each style balances creativity vs. reliability and autonomy vs. safety in different ways to fit real clinical needs.
•The survey connects big ideas to real systems, tools, datasets, and how we should test clinical agents safely.
•It highlights open challenges like trust, updates to medical knowledge, clear reasoning, and ethical guardrails.
•The big message: make medical AI that doesn’t just talk but can plan, verify with evidence, remember, collaborate, and keep improving while staying safe.

Why This Research Matters

This work shows how to build medical AIs that don’t just talk—they help safely. It helps hospitals choose the right kind of agent for each job, from evidence-cited summaries to strict triage workflows. It reduces errors by making agents check tools and sources instead of guessing. It builds trust with clear logs, citations, and handoffs to humans when needed. It supports fairer access to care by scaling guidance while preserving safety. And it points the way to learning systems that stay up-to-date as medicine changes.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 You know how a great coach doesn’t just cheer but also makes a game plan, remembers past games, checks the rulebook, and helps the team play safely? Clinical conversations need that kind of smart help too. They must be kind and careful at the same time.

🥬 The Concept: Clinical dialogue is a special kind of talk between patients and clinicians where the goal is better health. It mixes empathy (being kind) with exact medical facts.

How it works:
1. A patient shares worries or symptoms.
2. The clinician asks targeted questions to understand what’s really going on.
3. They check facts (like tests or records), explain options, and decide next steps.
4. They keep track over time, because health changes.
Why it matters: If the system can’t understand, check facts, or remember, it may give advice that sounds nice but isn’t safe.

🍞 Example: A patient says, “I’m dizzy.” A good dialogue asks about timing, triggers, medications, and checks records before suggesting care.

🍞 You know how a smart parrot can repeat sentences but doesn’t actually understand or plan what to do next? That’s like earlier AI in clinics.

🥬 The Concept: Large Language Models (LLMs) are great at talking but, by default, they are reactive (they answer one message at a time), stateless (they forget long-term), and can hallucinate (make up confident but wrong facts).

How it works:
1. They predict the next word from patterns in huge text datasets.
2. They don’t automatically check real databases or tools.
3. They don’t keep a long-running memory of the patient unless we build it in.
Why it matters: Without checking and memory, fluent answers can be wrong or inconsistent across visits.

🍞 Example: Asked about a drug dose, a raw LLM might sound confident but pick the wrong number if it can’t call a calculator or guideline.

🍞 Imagine trying to build a Lego castle with no instructions, no box picture, and forgetting where you put pieces yesterday. Frustrating, right?

🥬 The Concept: Hallucination and statelessness are two big problems in medical AI.

How it works:
1. Hallucination: the model guesses something that sounds right but isn’t verified.
2. Statelessness: the model treats each chat as new and forgets long-term patient history.
Why it matters: In medicine, a wrong fact or lost history can be unsafe.

🍞 Example: The model “remembers” a penicillin allergy one day, forgets it the next—dangerous!

🍞 Picture a helpful librarian who not only chats but also plans searches, keeps a reading log, checks trusted encyclopedias, invites other librarians when needed, and learns from feedback. That’s what we want.

🥬 The Concept: The Agentic Paradigm turns LLMs into active helpers that plan, remember, use tools, team up, and evolve.

How it works:
1. Strategic planning: break a big goal into steps.
2. Memory management: keep patient details over time.
3. Action execution: use tools/databases to verify facts.
4. Collaboration: different agents or roles work together.
5. Evolution: learn from feedback and improve.
Why it matters: This shifts from “pretty words” to “safe, helpful workflows.”

🍞 Example: For chest pain, the agent plans questions, pulls EHR history, runs a risk score tool, cites guidelines, consults a specialist agent, and documents everything.

🍞 Think of a map with two axes: where your information comes from, and what you try to do with it.

🥬 The Concept: This survey’s taxonomy has two axes: Knowledge Source (inside the model vs. trusted external sources) and Agency Objective (understand the situation vs. execute a workflow).

How it works:
1. Knowledge source: Implicit (inside LLM) vs. Explicit (grounded in EHRs/guidelines/tools).
2. Objective: Event cognition (understand/summarize) vs. Goal execution (act/complete tasks).
3. Combine them to get 4 agent types.
Why it matters: It explains trade-offs: creativity vs. reliability, autonomy vs. safety.

🍞 Example: A creative inside-knowledge agent can brainstorm diagnoses, while an evidence-grounded workflow agent follows a verified triage path step by step.

🍞 Imagine four superheroes, each with a different specialty.

🥬 The Concept: The four paradigms are:

Latent Space Clinician (LSC): uses internal knowledge to interpret cases.
Grounded Synthesizer (GS): retrieves and cites trusted evidence.
Emergent Planner (EP): autonomously plans multi-step clinical tasks.
Verifiable Workflow Automator (VWA): follows strict, testable workflows.
How it works:
1. LSC: good at creative sense-making within the model.
2. GS: strict evidence-first summaries with citations.
3. EP: plans and adapts during tasks.
4. VWA: executes protocol steps with audit trails.
Why it matters: Different clinics need different balances of creativity and safety.

🍞 Example: LSC suggests likely diagnoses; GS cites guidelines; EP runs a conversation to set asthma care goals; VWA executes a triage script safely.

🍞 You know how doctors use EHRs (Electronic Health Records) to see your past visits?

🥬 The Concept: EHR access is key to safe AI.

How it works:
1. The agent securely queries EHR data with permissions.
2. It summarizes relevant history.
3. It updates memory as new info arrives.
Why it matters: Without real patient data, advice can be off-target.

🍞 Example: The agent avoids suggesting ibuprofen if the EHR shows a kidney issue.

Overall, the world before was full of fluent but unreliable AI chat. The problem is safety, memory, and real tool use. Past fixes (rigid pipelines, simple retrieval, plain LLMs) missed one of the core needs. The gap: turning talkers into doers who plan, ground, remember, collaborate, and improve. The stakes are real: safe, timely, personalized care for everyone.

02Core Idea

🍞 Imagine moving from a GPS that only repeats street names to a co-pilot that plans the route, checks traffic data, remembers your preferences, talks to other pilots, and learns over time.

🥬 The Concept (Aha!): Make clinical AIs agentic—able to plan, remember, use tools, collaborate, and evolve—organized by a simple 2x2 map: where knowledge comes from (inside vs. external) and what agents aim to do (understand vs. act).

How it works:
1. Define two axes: Knowledge Source (implicit vs. explicit) and Agency Objective (event cognition vs. goal execution).
2. Get four agent styles (LSC, GS, EP, VWA), each with strengths and trade-offs.
3. Build each style with five building blocks: planning, memory, action, collaboration, evolution.
Why it matters: It turns LLMs from guessers into reliable teammates matched to clinical needs.

🍞 Example: A clinic picks a GS for medication safety checks (evidence-first) and a VWA for triage (strict workflow), while an EP coaches diabetes behavior change.

Three analogies:

School project teams: The brainstormer (LSC), the fact-checker (GS), the project manager (EP), and the checklist enforcer (VWA).
Kitchen: The creative chef (LSC), the recipe librarian (GS), the head chef coordinating stations (EP), and the food safety inspector ensuring rules (VWA).
Sports: The play-reader (LSC), the stats analyst (GS), the field captain (EP), and the rule-official (VWA).

🍞 Before vs. After:

Before: LLMs answered nicely but could be wrong, forgetful, and tool-blind.
After: Agents plan, verify, remember, team up, and learn—tailored to when creativity or safety matters most.

🍞 Why it works (intuition):

Separate thinking (planning) from doing (actions) and remembering (memory), then constrain or empower each part based on safety needs.
Use external evidence when accuracy is critical; use internal knowledge when creativity or speed helps.
Add collaboration to reduce single-model blind spots, and evolution to keep up with changing medicine.

🍞 Building Blocks (each with a quick sandwich):

Strategic Planning

🍞 Imagine writing a to-do list before cleaning your room.
🥬 What: Break big clinical goals into steps; decide order and checkpoints. How: Decompose goals, simulate options, pick safest path. Why: Without it, the agent may wander.
🍞 Example: Chest pain → ask key questions → check risk score → advise care level.

Memory Management

🍞 Think of a patient diary plus a medical textbook.
🥬 What: Parametric memory (what the model knows) + non-parametric memory (patient-specific notes). How: Store, update, and retrieve key facts across visits. Why: Without memory, advice flips and misses history.
🍞 Example: Remember long-term A1c trends when planning diabetes care.

Action Execution

🍞 Like using a calculator instead of estimating in your head.
🥬 What: Call tools (calculators, search, EHR queries) to get trustworthy results. How: Map intent → precise API/query → parse result. Why: Prevents math errors and outdated advice.
🍞 Example: Use a dose calculator tool; cite the guideline.

Collaboration

🍞 A medical team is better than a lone hero.
🥬 What: Multiple agents (roles) debate, divide tasks, and merge answers. How: Orchestrator or peer network; consensus rules. Why: Reduces bias and surfacing blind spots.
🍞 Example: Cardio-agent and drug-safety-agent agree on next steps.

Evolution

🍞 Practice makes progress.
🥬 What: Learn from feedback to improve policies and prompts. How: Store outcomes, update strategies safely. Why: Keeps systems current and reliable.
🍞 Example: After repeated misses on asthma education, improve that module.

🍞 Anchor: A hospital configures the 2x2: LSC for rapid sense-making in rare disease clinics, GS for evidence-cited discharge summaries, EP for smoking-cessation coaching plans, and VWA for ED triage workflows. Each agent uses planning, memory, tools, teams, and learning in a safety-tuned way.

03Methodology

High-level recipe: Input (patient message + EHR context) → Strategic Planning (set safe steps) → Memory Management (recall long-term facts) → Action Execution (query tools/databases) → Collaboration (merge specialists’ views) → Evolution (log feedback, improve) → Output (empathetic, verified advice).

Strategic Planning

What happens: The agent turns a big goal (e.g., triage, education, diagnosis assist) into a step-by-step plan.
Why it exists: Without a plan, the system may jump to conclusions.
Example data: For “I’m short of breath,” plan steps: check red flags, review EHR vitals, compute risk score, draft advice, cite sources.
Sandwich intro: 🍞 You know how you draw a treasure map before hunting? 🥬 Planning is making that map (steps, checkpoints, detours) so the agent doesn’t get lost. 🍞 Example: Chest pain map → questions → risk score → safe recommendation.

Memory Management

What happens: The system keeps two memories. • Parametric (textbook-in-the-head): general medical knowledge and patterns. • Non-parametric (patient scrapbook): notes, summaries, citations, tool outputs.
Why it exists: Medicine is longitudinal; forgetting past allergies or labs is unsafe.
Example data: Store “penicillin allergy,” “A1c over 12 months,” last discharge plan.
Sandwich intro: 🍞 Imagine sticky notes on your desk plus what you’ve already learned at school. 🥬 Memory is both: short-term notes and long-term knowledge. 🍞 Example: Remembering kidney issues before suggesting NSAIDs.

Action Execution

What happens: The agent calls tools instead of guessing. • Knowledge-based: query knowledge graphs/ontologies. • Search engine: gather up-to-date, diverse evidence; then constrain and cite. • Calculators/clinical scores: offload to deterministic tools (dose, risk scores).
Why it exists: Tools prevent hallucinations and ensure math/logic correctness.
Example data: API call to a dosing calculator with age/weight, return exact dose.
Sandwich intro: 🍞 When in doubt, use a ruler instead of eyeballing. 🥬 Tools give precise, verified answers. 🍞 Example: Use a stroke-risk score tool instead of estimating.

Collaboration

What happens: Systems can use one agent or many specialized agents. • Dominant topology: one orchestrator decomposes, delegates, and composes. • Distributed topology: peers debate and reach consensus.
Why it exists: Teams reduce single-model bias and cover multiple specialties.
Example data: Cardio agent checks ECG rules; Pharm agent checks drug interactions; orchestrator merges.
Sandwich intro: 🍞 Two heads are better than one—especially in medicine. 🥬 Collaboration means specialty agents share and reconcile. 🍞 Example: A tumor board-like debate within agents.

Evolution

What happens: Systems update strategies and prompts from safe feedback. • For GS: learn better search queries and tool-calling policies. • For EP/LSC: refine reasoning steps and guardrails. • For VWA: improve workflow routing, keep rules up to date.
Why it exists: Medicine changes; systems must keep pace without forgetting basics.
Example data: Logs show confusion on inhaler teaching → add clearer scripts and checks.
Sandwich intro: 🍞 Practice drills make the team sharper. 🥬 Evolution turns feedback into better future behavior. 🍞 Example: Faster, more accurate lab retrieval after learning to map MeSH terms.

Secret sauce of this paper’s method:

A simple 2x2 map (knowledge source × agency objective) that cleanly explains why some agents should be creative (LSC/EP) while others must be evidence-locked (GS/VWA).
Clear building blocks (planning, memory, action, collaboration, evolution) that can be mixed to fit the quadrant’s safety/creativity needs.
A bridge from abstract ideas to concrete tools, datasets, and evaluation metrics so builders can implement safely.

Concrete walkthrough (sample patient message: “My chest feels tight and I’m sweaty.”):

Planning: Triage path with red-flag questions; plan to compute HEART score.
Memory: Pull EHR history (age, risk factors) from non-parametric memory; use parametric memory for symptom patterns.
Action: Call EHR API for vitals; call risk calculator tool; run guideline search for current troponin protocols; store outputs in memory with citations.
Collaboration: Pharm agent checks meds; Cardio agent reviews risk; Orchestrator composes.
Evolution: Log which queries were most helpful; next time, ask fewer but more targeted questions.

04Experiments & Results

The Test (what to measure and why):

Factual accuracy and citation quality: Does the agent back claims with trusted sources?
Safety: Does it avoid harmful advice and catch red flags?
Reasoning/process quality: Are steps logical and auditable?
Memory continuity: Does it stay consistent across visits?
Efficiency: Does it reduce time-to-answer and clicks for clinicians?
Patient experience: Is the tone empathetic and clear?

The Competition (what to compare against):

Traditional pipelines (NLU + state tracker + policy): great control, low flexibility.
Pure retrieval bots: safe facts, but impersonal and brittle for complex, multi-hop questions.
Plain LLM chat: fluent but can hallucinate, forget, and lacks tool use.

The Scoreboard (with context):

Grounded Synthesizers often score “A” on factuality and citation (like getting full credit for showing your work) compared to plain LLMs’ “C+”.
Verifiable Workflow Automators tend to earn “A” for safety and predictability (like following a lab protocol exactly) compared to emergent planners’ “B” that’s more flexible but riskier without guardrails.
Emergent Planners can get “A-” on patient engagement and goal completion in behavior change counseling (they adapt steps), while retrieval bots might get a “B-” for being stiff.
Latent Space Clinicians often get “B+/A-” on creative diagnostic sense-making but can drop to “C” on citations if not paired with grounding.

Surprising findings:

Small, well-aimed tool calls (calculators, guideline APIs) can boost safety more than large model upgrades.
A simple collaboration topology (orchestrator + two specialists) often beats a single giant model for tricky cases.
Clear memory logs with citations improve clinician trust even when the final answer is the same.
Teaching the agent when NOT to answer (and to escalate) is as valuable as teaching it to answer.

Note: As a survey, this paper synthesizes trends rather than reporting a single new benchmark. It maps real systems to the taxonomy and highlights which metrics best reflect safety and usefulness in clinical settings.

05Discussion & Limitations

Limitations:

Real-world validation is scarce: many systems are tested in labs, not across diverse hospitals and populations.
Integration complexity: safely connecting to EHRs, tools, and audit logs is non-trivial.
Updating knowledge: keeping guidelines and calculators current is a constant effort.
Interpretability: even with logs, some internal reasoning remains opaque.

Required resources:

Secure data access (EHR APIs), audited tool chains, and strong privacy/consent controls.
Domain-tuned prompts/policies, and governance for guardrails and escalation.
Logging, monitoring, and evaluation pipelines tied to safety metrics.

When NOT to use:

Undifferentiated open-domain chats for high-stakes advice without grounding or workflows.
Situations lacking data permissions or with unclear provenance (no way to verify sources).
Emergencies where seconds matter and only certified, deterministic protocols are allowed.

Open questions:

Best practices for human-in-the-loop oversight without clinician overload.
Robust, standardized benchmarks for agentic safety (not just accuracy).
Hybrid neuro-symbolic designs that mix LLM intuition with strict medical logic.
Fairness and equity: ensuring agents perform well across languages, cultures, and rare conditions.
Lifelong learning without forgetting: how to evolve safely as medicine changes.

06Conclusion & Future Work

Three-sentence summary:

This paper explains how to turn chatty LLMs into agentic clinical helpers that plan, remember, use tools, collaborate, and evolve.
It introduces a simple 2x2 map (knowledge source × agency objective) to define four agent styles—LSC, GS, EP, and VWA—each balancing creativity, reliability, autonomy, and safety.
It links these ideas to concrete components, tools, datasets, and evaluation practices to guide safe, real-world healthcare AI.

Main achievement:

A first-principles taxonomy that cleanly explains design trade-offs and aligns technical choices (planning, memory, tools, collaboration, evolution) with clinical safety and utility.

Future directions:

Better safety benchmarks, clearer audit trails, easier EHR/tool integration, hybrid neuro-symbolic methods, and robust human-in-the-loop governance.

Why remember this:

Because the future of medical AI isn’t about chatting better—it’s about reliably helping people by planning, verifying, coordinating, and learning, all while keeping patients safe.

Practical Applications

•Deploy a Grounded Synthesizer to generate discharge summaries with citations from EHR and guidelines.
•Use a Verifiable Workflow Automator to run emergency department triage following validated protocols.
•Adopt an Emergent Planner to guide step-by-step smoking cessation or diabetes education conversations.
•Enable medication safety checks that call drug-interaction tools and cite official sources.
•Create a multi-agent second-opinion assistant where specialist agents debate and log consensus.
•Automate chart review using a distributed pipeline that extracts entities, validates facts, and summarizes.
•Add risk calculators and dosing tools to agent toolkits to prevent numerical errors.
•Build a memory layer that preserves key patient facts across visits and flags contradictions.
•Establish human-in-the-loop escalation policies when evidence conflicts or red flags appear.
•Continuously refine search and tool-calling policies based on logs to speed up reliable retrieval.

Version: 1