AgentEHR: Advancing Autonomous Clinical Decision-Making via Retrospective Summarization

Yusheng Liao; Chuan Xuan; Yutong Cai; Lina Yang; Zhe Chen; Yanfeng Wang; Yu Wang

AgentEHR: Advancing Autonomous Clinical Decision-Making via Retrospective Summarization

Intermediate

Yusheng Liao, Chuan Xuan, Yutong Cai et al.1/20/2026

arXiv PDF

Key Summary

•AgentEHR is a new, realistic test that asks AI agents to read messy hospital records and make full clinical decisions, not just look up facts.
•RETROSUM is the paper’s method that regularly looks back at the whole interaction and re-summarizes it so no crucial clue gets lost.
•Instead of replacing history with a short summary, RETROSUM keeps the full history and adds a smart, updated summary to guide reasoning.
•An experience memory helps the agent reuse good strategies from past patients, making decisions steadier and faster.
•Across six tasks (diagnoses, labs, microbiology, prescriptions, procedures, transfers), RETROSUM improved F1 scores by up to 29.16% over strong baselines.
•RETROSUM cut interaction mistakes dramatically, reducing total errors by up to 92.3% compared with evolving-memory baselines.
•It stayed strong when the database format changed (OOD) and when the context window got tight (down to 8k tokens).
•RETROSUM also needed fewer turns and tokens on average, saving time while improving accuracy.
•The benchmark and toolbox mirror real EHR complexity, helping move AI from toy demos to clinic-like scenarios.

Why This Research Matters

In real hospitals, decisions depend on many small clues spread over long timelines, and missing even one link can change care. This work shows how to keep the full story and still highlight what matters as new evidence arrives, making AI less likely to forget crucial details. It reduces wasted time and clicks by steering agents away from loops and toward valid, standardized answers. The approach also generalizes better when the database looks different or when labels are rare, a common reality in healthcare. By teaching agents ‘how to think’ through experience memories, it supports steadier, faster reasoning. Overall, it moves AI from lookup helpers toward trustworthy clinical teammates.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re helping a doctor figure out what’s wrong with a patient. The patient’s whole story is inside a giant digital diary (the EHR) with thousands of entries—lab results, meds, notes, transfers—written over months or years. It’s like finding one right puzzle picture using pieces scattered across dozens of boxes.

🥬 Filling (The Actual Concept 1: Multi-step Clinical Decision-Making)

What it is: Doctors (and AI helpers) must combine many clues over time to decide diagnoses, tests, treatments, and where the patient should go next.
How it works:
1. Gather clues (symptoms, vitals, lab trends, meds)
2. Form early ideas (possible diagnoses)
3. Order/check tests to confirm or rule out options
4. Choose treatments and the right care unit
Why it matters: If you skip a step or miss a link between steps, you might miss a dangerous condition or order the wrong care. 🍞 Bottom Bread (Anchor): Think of sepsis: a fever appears early, but high lactate shows up later. Only by linking both across time do you trigger the sepsis protocol fast enough.

🍞 Top Bread (Hook): You know how GPS helps you get around a new city without getting lost? Hospitals are like cities made of tables instead of streets—admissions, labs, procedures, meds, and more.

🥬 Filling (The Actual Concept 2: EHR Navigation)

What it is: Finding the right information inside all those EHR tables.
How it works:
1. Check the map (what tables and columns exist)
2. Search by time, keyword, or value
3. Pull just the rows that matter now
4. Repeat as the situation changes
Why it matters: Without good navigation, you dig up too much noise (irrelevant rows) or miss crucial evidence. 🍞 Bottom Bread (Anchor): If a patient had kidney problems last year, EHR navigation helps you quickly find creatinine trends from lab events to guide safe drug dosing today.

🍞 Top Bread (Hook): Ever try to remember a long story but forget a tiny detail that changes the ending? Clinical stories are like that—lots of details matter days or weeks apart.

🥬 Filling (The Actual Concept 3: Long-context Information Processing)

What it is: Using a long, detailed history without losing small but critical links.
How it works:
1. Keep track of many events over time
2. Notice cross-time connections (e.g., a new drug then a lab change)
3. Summarize without throwing away key specifics
4. Update as new clues arrive
Why it matters: If a summary is too short or one-directional, you cut the thread that ties early clues to later findings. 🍞 Bottom Bread (Anchor): If an antibiotic started on Monday lines up with improving white count on Wednesday, that link affects whether you continue, change, or stop therapy.

The world before: Many LLM-in-health papers showed promise on neat, pre-filtered inputs. Agents often acted like fancy search bars, doing query rewriting or fetching obvious facts. But real EHRs are messy, long, and full of noise. When agents tried to handle real decision-making (like choosing diagnoses plus next labs and treatments), they hit limits: context windows filled up, weak summaries dropped important details, and reasoning chains broke.

The problem: Clinical agents must talk to the EHR many times, building long histories full of partial clues. Simple, one-way summarization often shrinks context by throwing out details that only become important later. That fractures reasoning continuity in exactly the places medicine needs it most.

Failed attempts:

ReAct and classic tool-using agents retrieve facts but don’t protect long, cross-time logic.
ReSum compresses history but can cut hidden links; stronger models sometimes get worse under it.
Evolving-memory methods help in general tasks, yet in EHR they can be unstable, offering little help when summaries are lossy.

The gap: We need a way to 1) regularly re-check the whole history so late clues can rescue early facts, and 2) learn reusable strategies from past cases, so the agent gets better at picking tools, filtering noise, and keeping crucial numbers and timelines.

Real stakes: In daily life, this means catching sepsis sooner, avoiding harmful drug interactions, ordering the right tests the first time, and moving patients to the right unit without delay. Time back to clinicians, fewer errors for patients, and calmer, more dependable AI teamwork at the bedside.

02Core Idea

🍞 Top Bread (Hook): You know how a great detective flips back through earlier pages when a new clue appears? That’s how we should read EHRs too.

🥬 Filling (The Actual Concept 4: AgentEHR)

What it is: A new benchmark that makes AI agents do full clinical decision-making inside real, noisy EHRs.
How it works:
1. Gives six tasks across patient care (diagnoses, labs, microbiology, prescriptions, procedures, transfers)
2. Forces multi-step tool use on raw tables, not pre-cleaned notes
3. Scores the final decisions using realistic candidate sets
Why it matters: If tests only measure fact lookup, we’ll overestimate agents. AgentEHR measures true clinic-like decision-making. 🍞 Bottom Bread (Anchor): Instead of just asking, “What was the last medication?” AgentEHR asks, “What are the likely diagnoses now and which labs and meds fit next?”

🍞 Top Bread (Hook): Imagine you keep a full travel diary but also make short travel notes after every few stops so you don’t miss key turns.

🥬 Filling (The Actual Concept 5: RETROSUM)

What it is: A method that keeps the full history while adding periodic, look-back summaries that update when new clues arrive.
How it works:
1. Interact with the EHR using tools, creating a growing history
2. Every w steps, retrospectively summarize by re-reading distant and recent events together
3. Keep both the raw history and the fresh summary for the next action
Why it matters: One-way compression drops fragile links; RETROSUM preserves them by looking back and not deleting the raw trail. 🍞 Bottom Bread (Anchor): If a fever appears now and a high lactate was seen long ago, the retrospective pass reconnects them so the agent triggers sepsis care instead of missing it.

🍞 Top Bread (Hook): Think of editing a school essay: after writing a new paragraph, you reread the whole thing to fix earlier parts that no longer fit.

🥬 Filling (The Actual Concept 6: Retrospective Summarization)

What it is: Re-summarizing the history at intervals with new evidence in mind.
How it works:
1. Split the timeline into distant history and a recent window
2. Re-evaluate old events using new findings
3. Produce a new, high-level summary that highlights cross-time links
Why it matters: Early “unimportant” facts might become crucial later; this step recovers them. 🍞 Bottom Bread (Anchor): A mild creatinine bump seemed trivial… until a new ACE inhibitor was started. Retrospective summarization surfaces the drug–kidney link.

🍞 Top Bread (Hook): Imagine a coach’s playbook that grows smarter after each game by saving good plays and common traps.

🥬 Filling (The Actual Concept 7: Evolving Experience Strategy)

What it is: A memory that stores how to think (not just facts) from past cases and retrieves the best-fit tips for new ones.
How it works:
1. After a case, reflect on what actions helped or hurt
2. Save compact “how-to” lessons in a memory bank with a case embedding
3. For a new patient, fetch the most similar memory and apply its advice
Why it matters: In complex EHRs, good tool choices and noise filtering are half the battle; experience boosts both. 🍞 Bottom Bread (Anchor): If past cases taught “prioritize candidate-matching before more record scraping,” the agent avoids endless raw data loops and grounds predictions sooner.

The Aha! moment in one sentence: Don’t shrink history away—keep it, look back regularly to reconnect clues, and borrow thinking strategies from past, similar patients.

Multiple analogies:

Detective: Re-open old pages after a new clue; keep case files intact, plus summary notes.
Cooking: Taste as you go; adjust the earlier seasoning when a late ingredient changes the dish.
Hiking: Keep the full trail on your GPS; add periodic annotations so you won’t miss forks you passed long ago.

Before vs. after:

Before: One-way, lossy summaries; strong models sometimes got worse; agents acted like search engines.
After: Retrospective, two-lane context (raw + summary) and learned strategies; stable gains across models and datasets.

Why it works (intuition): Clinical meaning often lives in cross-temporal links (medication → lab change → symptom relief/worsening). Retrospective passes rescue and highlight these links. Keeping raw history prevents logic gaps. Experience retrieval guides tool use and filtering so the agent both thinks better and wastes less effort.

Building blocks:

A realistic AgentEHR benchmark with six tasks.
Retrospective summarization every w steps (windowed look-back).
Retrospective inference that feeds both raw history and the fresh summary to the actor.
Post-hoc reflection to generate actor and summarizer “how-to” experiences.
A memory bank keyed by patient-context embeddings to fetch the most relevant tips at inference.

03Methodology

At a high level: Input (patient ID, time, instruction) → Multi-step EHR tool use to gather clues → Retrospective summarization every w steps → Next action chosen using full history + summary (retrospective inference) → Final predictions (e.g., diagnoses, labs, meds) → After training cases, build experience memory → At test time, retrieve best-fit experiences to guide both summarizer and actor.

Step-by-step with purpose and examples:

Define the task and tools

What happens: The agent gets a patient ID, a reference time, and a clinical instruction (e.g., “List plausible diagnoses now”). It can use a toolbox via an MCP server: record queries (time filters, keyword/value search, SQL), candidate alignment (keyword/fuzzy/semantic), schema inspection, internal thinking, and external knowledge.
Why this step exists: Real EHRs are big and messy. Without structured tools, the agent would drown in data.
Example: For suspected infection, the agent checks labevents for WBC and lactate, prescriptions for antibiotics, and microbiologyevents for cultures.

Multi-step EHR navigation

What happens: The agent alternates reasoning and acting (ReAct-style). It plans, queries the right table slices, inspects results, and repeats.
Why it exists: Clinical problems require several hops; one query rarely suffices. You gather, test, and refine.
Example: A new fever prompts checking vitals, then labs for CRP/ESR, then meds and cultures.

Retrospective summarization (every w steps)

What happens: At steps j divisible by w, the summarizer re-reads the distant history and the recent window together and produces an updated summary that highlights cross-time links.
Why it exists: One-way summaries lose details that become important later; a retrospective pass rescues them.
Example: Early mild creatinine bump is re-labeled as important after ACE inhibitor start shows up; the summary now flags a possible drug–kidney issue.

Retrospective inference (keep raw history + add summary)

What happens: The actor never throws away raw steps. It receives the entire action–observation history plus the latest summary and decides the next tool call.
Why it exists: Replacing history with a short digest can break logical chains. Keeping both preserves precision (raw) and focus (summary).
Example: With both fever (recent) and high lactate (older) visible, the actor orders sepsis-related labs and flags sepsis risk instead of chasing an unrelated cause.

Candidate alignment before finishing

What happens: Before producing final predictions, the agent maps free-form ideas to official candidate sets (e.g., CCS for diagnoses, ATC for meds) using robust keyword, fuzzy, and semantic matching.
Why it exists: Outputs must be valid, standardized entities to be scored and to be clinically usable.
Example: “Kidney failure” is mapped to the correct CCS label, not a stray synonym.

Experience generation after training cases

What happens: For each training trajectory, the system compares predictions to ground truth and asks, “What was a good or bad reasoning move?” It extracts two kinds of experience: actor strategies (e.g., prioritize candidate grounding over more scraping) and summarizer rules (e.g., keep exact abnormal values, link meds to lab changes with timestamps).
Why it exists: EHR success is as much about process as about facts; reusable process tips compound over time.
Example: A rule like “When lab and symptom conflict, order confirmatory tests before finalizing” gets stored.

Build the memory bank and retrieve at inference

What happens: Each experience is stored with an embedding of recent clinical context. At test time, the agent embeds the new case, retrieves the most similar memory, and conditions both summarizer and actor on it.
Why it exists: Similar cases often share good playbooks; retrieving the right one boosts stability and speed.
Example: For a hemodynamic instability case, retrieved experience says “check triage vitals and lactate early” to avoid late sepsis recognition.

The secret sauce:

Two-lane context: keep the full raw story plus an evolving, retrospective summary that learns from new evidence.
Experience that teaches how to think: not medical trivia, but robust decision habits (what tools to use when, what details to preserve, how to handle contradictions).
Synergy: Clean summaries make retrieved experiences more useful; good experiences make summaries sharper and actions more efficient.

Concrete mini-walkthrough (sepsis-flavored):

Input: “What are the likely diagnoses and next labs?”
Turns 1–9: Agent checks vitals, sees fever; looks at labevents, misses lactate so far.
Turn 10 (retrospective): Summary re-reads distant labs, now surfaces an old high lactate.
Turns 11–15: Actor, seeing raw history + updated summary, orders blood cultures and lactate recheck; aligns candidates to infection/sepsis labels.
Finish: Outputs plausible diagnoses (e.g., sepsis), recommended labs (lactate, blood cultures), and maps to official candidates.
Afterward (training only): Stores experience: “Always link fever with lactate trend; keep exact values and timestamps.”

04Experiments & Results

The test: AgentEHR spans six clinical tasks—Diagnoses, Labevents, Microbiology, Prescriptions, Procedures, and Transfers—covering a patient’s journey. Agents must act inside real EHR subsets: MIMIC-IV Common (in-distribution), MIMIC-IV Rare (label-shift OOD), and MIMIC-III (system-shift OOD with different schema/density). Metrics include F1 for final predictions and detailed error taxonomies (e.g., tool loops, no-candidate, parsing failures).

The competition: Baselines include ReAct, Reflexion, ReSum (unidirectional summarization), ReflecTool (two variants), and ReasoningBank (evolving memory). Backbones range from open Qwen3 models (30B/80B/235B) to GPT-5-mini and Grok-4.1-fast.

The scoreboard with context:

Core win: RETROSUM consistently beats baselines across models and tasks. On Grok-4.1-fast, RETROSUM (evolved) reaches about 0.288 average F1—a clear improvement over ReAct and a strong lift over ReSum, which underperforms with strong backbones.
Size-robust: Even without the evolving memory, RETROSUM outperforms ReSum on smaller Qwen3-30B and larger GPT-5-mini. This shows the retrospective mechanism’s generality.
Big picture translation: Gains up to 29.16% are like turning a class average from mid B- to solid A work. Error reductions up to 92.3% mean far fewer dead-ends and format mistakes.

Surprising findings:

Strong models can be hurt by unidirectional summaries (ReSum). When the base model is already a good reasoner, throwing away raw history removes exactly the nuance it uses to shine. RETROSUM avoids that trap by keeping the full history.
Evolving memories help only when summaries are trustworthy. ReasoningBank and ReflecTool show mixed results, but when paired with RETROSUM’s clean retrospective context, experience retrieval becomes reliably beneficial.

Generalization (OOD):

Label-shift (MIMIC-IV Rare): RETROSUM leads on rare diagnoses, indicating it relies on medical logic rather than memorized frequencies.
System-shift (MIMIC-III): While ReSum drops due to schema brittleness, RETROSUM stays strong. Retrospective linking of medical cause–effect transfers across databases better than rigid, one-way compression.

Ablations and knobs:

Retrospective on actor only helps at very frequent summarization intervals (protects reasoning continuity).
Retrospective on summarizer only shines at longer intervals (guards distant facts).
Together they’re best (the full RETROSUM), and adding evolving experience delivers the top overall scores.

Efficiency and scaling:

Turns: ReSum often hits the 100-turn cap; RETROSUM concentrates around 20–40 turns, escaping redundant loops.
Tokens and time: Despite periodic look-backs, total input tokens drop sharply (about 4.9× fewer than ReAct) and latency improves versus ReAct/ReSum by preventing wasted digging.
Test-time scaling: Best@K curves rise steadily; RETROSUM outperforms baselines from K=1 up, showing both robust first-try quality and headroom when sampling more.
Context pressure: When max tokens shrink from 64k to 8k, baselines sag, but RETROSUM holds up because it rescues crucial links via retrospective passes.

05Discussion & Limitations

Limitations (specific):

Data scope: Results rely on MIMIC-IV and MIMIC-III (single-center lineage). Policy differences and global demographics may differ.
Modality gap: Current agent works on structured tables and text; imaging pixels and waveform streams are outside scope (needs reports instead).
Experience quality: Memory retrieval depends on good embeddings and reflections; poor matches could import irrelevant habits.
Hyperparameters: The summarization window w may need tuning per task/model.
Governance: Not yet deployed in live care; safety, auditability, and bias checks are required.

Required resources:

An LLM with a sizable context window and tool-use capability.
The MCP toolbox with record, candidate, schema, and knowledge tools.
Access to curated per-patient SQLite EHRs and candidate tables.
GPU/accelerator for longer-context runs; an embedding model and a simple vector store for experiences.

When not to use:

Real-time ICU streaming where second-by-second waveform analysis is needed.
Pure imaging workflows without text interpretations.
Ultra-low-latency settings where even short retrospection is too slow.
Simple fact lookups where a direct query tool suffices.

Open questions:

Safety and guardrails: How to detect and block unsafe plans in long chains?
Fairness and drift: Does performance hold across populations, seasons, and changing hospital policies?
Multimodal fusion: How to integrate imaging and waveforms without losing retrospective benefits?
Online learning: Can experiences update continuously with guarantees against catastrophic forgetting?
Human-AI teaming: What’s the best way for clinicians to see, edit, and approve the agent’s evolving summaries and plans?

06Conclusion & Future Work

Three-sentence summary: AgentEHR is a realistic benchmark that forces AI agents to make end-to-end clinical decisions inside raw, noisy EHRs. RETROSUM solves long-context fractures by keeping the full history and periodically re-summarizing it in light of new clues, then guiding actions with both the raw trail and an updated map while reusing learned strategies from similar past cases. Across multiple models, tasks, and datasets, it significantly boosts accuracy, slashes errors, and runs more efficiently.

Main achievement: Proving that retrospective, two-lane context (raw + refreshed summary) plus experience retrieval is the key to stable, strong clinical decision-making in complex EHR environments.

Future directions:

Multimodal expansion to imaging and waveforms with retrospective fusion.
Multi-center validation and domain adaptation to varied schemas and policies.
Stronger safety layers, calibration, and clinician-in-the-loop controls.
Richer experience indexing (e.g., temporal graphs) and multi-shot retrieval.

Why remember this: It replaces “compress-and-hope” with “look-back-and-link,” showing that in medicine, re-reading the story with each new clue—and learning from past stories—turns AI agents from searchers into thoughtful clinical partners.

Practical Applications

•Early sepsis detection by linking fever, lactate trends, and culture orders across time.
•Safer prescribing through timely recognition of drug–lab conflicts (e.g., ACE inhibitors and rising creatinine).
•Smarter lab ordering that prioritizes confirmatory tests and avoids redundant draws.
•Antibiotic stewardship by matching likely infections to appropriate microbiology tests and drug classes.
•Care transfer recommendations (e.g., ICU vs step-down) based on evolving vitals, labs, and supports.
•Medication reconciliation that cross-checks prior histories with current conditions to prevent interactions.
•Procedure planning that aligns diagnoses with appropriate CCS procedures and timing.
•Resident and student training with traceable reasoning chains and experience-based tips.
•Clinical operations auditing to spot and reduce agent tool loops and no-candidate errors.
•EHR system design feedback by highlighting schemas and tools that best support decision continuity.

Version: 1