Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

Yifei Shen; Yilun Zhao; Justice Ou; Tinglin Huang; Arman Cohan

Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

Intermediate

Yifei Shen, Yilun Zhao, Justice Ou et al.1/14/2026

arXiv PDF

Key Summary

•This paper introduces CLINSQL, a 633-task benchmark that turns real clinician-style questions into SQL challenges over the MIMIC-IV v3.1 hospital database.
•Unlike older datasets, CLINSQL requires building patient-similarity cohorts, joining many tables, and reasoning about time windows like first 24/48/72 hours.
•Each task comes with expert-written gold SQL and a detailed rubric that checks both the query logic and the executed results using a critical-first scoring tree.
•Twenty-two models were tested; GPT-5-mini led overall execution at 74.7%, while DeepSeek-R1 topped open-source at about 69%.
•Performance drops sharply on harder questions (for example, Gemini-2.5-Pro falls from 85.5% on Easy to 67.2% on Hard), showing that clinical reasoning is still tough for AI.
•Most mistakes came from drifting away from the intended patient cohort, mixing up output columns or formats, and getting clinical aggregations wrong.
•Chain-of-Thought prompting usually helps, and adding schema hints (like correct ICD filters and expected output columns) boosts accuracy further, especially on Medium/Hard.
•Execution often scores higher than pure SQL logic because different-but-equivalent SQL can still produce acceptable results under the rubric.
•CLINSQL highlights what real, reliable clinical text-to-SQL will need: precise coding, tight time logic, and trustworthy cohort reasoning.
•Limitations include a single health system (MIMIC-IV) and heavy expert effort to build and validate the dataset and rubrics.

Why This Research Matters

Hospital questions aren’t just trivia; they guide quality improvement, research, and resource planning. If an AI can reliably turn clinician questions into correct SQL over EHRs, teams can analyze large patient groups quickly and safely. CLINSQL shows where today’s models still stumble—especially on matching precise cohorts and time windows—so developers can fix the right problems. The rubric-based scoring makes success measurable and trustworthy, not just lucky execution. Better clinical text-to-SQL means faster insights about treatments, risks, and outcomes. That can help patients get care informed by evidence from many similar cases, not just one. Over time, these improvements can support safer, fairer, and more efficient healthcare.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

To make the complex feel simple, let’s first set the stage and introduce key ideas with mini 'sandwich' explainers.

🍞 You know how libraries keep records of many books and people who borrow them? 🥬 What it is: An Electronic Health Record (EHR) is a giant digital library of patient information like admissions, lab tests, medicines, and procedures. How it works (step by step): 1) Hospitals store patient visits in different tables; 2) Each table has columns such as age, lab value, or code; 3) We connect rows across tables using shared IDs like subject_id or hadm_id; 4) Analysts write database queries to find patterns. Why it matters: Without EHRs, we couldn’t search large groups of patient histories to answer clinical questions. 🍞 Example: Finding all ICU stays for women aged 70–80 in a hospital system.

🍞 Imagine asking a librarian, 'How many mystery books were borrowed last month?' 🥬 What it is: SQL is a language that asks databases questions. How it works: 1) SELECT which columns you want; 2) FROM which tables; 3) JOIN tables that relate; 4) WHERE filters like age or diagnosis; 5) GROUP BY to summarize; 6) ORDER/LIMIT to shape results. Why it matters: Without SQL, your question can’t be turned into a precise, repeatable recipe the computer can run. 🍞 Example: 'SELECT AVG(length_of_stay) FROM admissions WHERE gender = "F"'.

🍞 Picture telling a friend, 'Please look up this fact for me,' and they translate your words into a search. 🥬 What it is: Text-to-SQL means an AI takes a natural-language question and writes the SQL query for you. How it works: 1) Read the question; 2) Map words to the right tables/columns; 3) Add correct filters and joins; 4) Output executable SQL; 5) Run it to get a table of answers. Why it matters: Without text-to-SQL, most clinicians can’t easily tap into huge EHR databases. 🍞 Example: 'Among 70–80-year-old women with PE, report 90-day mortality by risk level' becomes a multi-table SQL query.

🍞 Think of forming a sports team with players who have similar skills and ages. 🥬 What it is: Patient-similarity cohort reasoning selects groups of patients who are medically similar (like age range, diagnosis codes, treatments) so their outcomes can be fairly compared. How it works: 1) Define who belongs (e.g., female, 70–80, PE codes); 2) Use the right medical codes (ICD, itemids) to identify conditions/tests; 3) Apply time windows (first 24 hours, 90-day outcomes); 4) Compute metrics (mortality, LOS) per group. Why it matters: Without tight cohort rules, results get biased or meaningless. 🍞 Example: Compare ICU mortality between patients with and without early HFNC within 24 hours.

🍞 You know how a story happens over time, not all at once? 🥬 What it is: Temporal windows are time-based slices like first 72 hours or 30-day readmission. How it works: 1) Pick the clock (admit time, ICU in-time); 2) Define windows (e.g., <= 24 hours); 3) Filter events to that window; 4) Summarize only those events. Why it matters: Without correct time windows, you might count events too early or too late and draw wrong conclusions. 🍞 Example: Only count diagnostic procedures done in the first 72 ICU hours.

🍞 Think of building a LEGO model using pieces from multiple boxes. 🥬 What it is: Multi-table joins link related tables like admissions, diagnoses, labs, and procedures. How it works: 1) Choose base table (e.g., admissions); 2) JOIN diagnoses on hadm_id; 3) JOIN patients on subject_id; 4) JOIN labs on hadm_id with time filters; 5) Keep correct keys to avoid dropping rows. Why it matters: Wrong joins lose patients or double-counts events. 🍞 Example: Admissions ↔ patients ↔ diagnoses_icd ↔ labevents to compute troponin categories.

🍞 Imagine barcodes at a store that identify products exactly. 🥬 What it is: Clinical coding systems (like ICD for diagnoses; itemid for ICU charted items) are standardized codes to find the right medical concepts. How it works: 1) Look up diagnosis/procedure/test codes; 2) Filter using code families (e.g., ICD-10 I21% for acute MI); 3) Ensure version-correctness (ICD-9 vs ICD-10); 4) Validate plausible ranges/units. Why it matters: Without precise codes, you grab the wrong patients or tests. 🍞 Example: Identify ACS using ICD-9 410%/411.1 and ICD-10 I21%/I22%.

The world before this paper: Text-to-SQL benchmarks like WikiSQL, Spider, and BIRD pushed AI to write SQL, but clinical care adds twists: specialized medical terms, careful time windows, and cohort-level comparisons. Prior clinical datasets (MIMICSQL, EHRSQL) showed promise but often centered on single-patient queries or simple stats, not the more realistic 'find similar patients and compare outcomes' workflow clinicians use.

The problem: Real clinical analytics needs the AI to (a) construct correct cohorts, (b) use correct medical codes, (c) respect time windows, and (d) produce clinically sensible result tables. Old benchmarks didn’t fully test those skills.

What people tried before: Generic text-to-SQL training, some clinical datasets, and execution-only scoring. These fell short because models could pass by luck (execution succeeds) without matching critical clinical rules, and they rarely needed full cohort logic.

The gap: A benchmark that forces cohort reasoning with medical codes and time, plus an evaluation that checks both SQL logic and executed result plausibility.

Real stakes: Doctors and hospital analysts use these questions to guide quality, safety, and research. If the cohort is wrong (say, including non-AMI patients), conclusions could mislead care teams or waste resources. CLINSQL exists to measure and improve how reliably AI answers these real questions.

02Core Idea

Here’s the paper’s big idea, then we’ll explain it three ways and break it into parts.

Aha! moment in one sentence: Build a clinically grounded text-to-SQL benchmark (CLINSQL) that requires patient-similarity cohort reasoning and judge it with a rubric that checks both SQL logic and executed outputs using critical-first scoring.

Three analogies:

Cooking show: The recipe (SQL) must use the right ingredients (tables, codes), follow timing (temporal windows), and plate the dish correctly (output columns). The judge doesn’t just taste the dish (execution) but also checks if you followed the recipe steps (SQL rubric).
Science fair: Students (models) must design an experiment (cohort), measure correctly (code mapping, time filters), and present clean charts (result CSV). Judges score must-haves first (critical nodes) before giving partial credit.
Sports tryouts: Players (patients) are selected by strict criteria (age, diagnosis codes). Coaches track performance over specific periods (first 72 hours). A ref (rubric) ensures rules are followed before the scoreboard (execution) counts.

🍞 Imagine a checklist a teacher uses that marks must-do items first. 🥬 What it is: Rubric-based evaluation with 'critical-first' means some checks are essential and must pass before others contribute to the score. How it works: 1) Break evaluation into a tree of checks; 2) Mark critical leaves (e.g., correct ICD filter); 3) If any critical check fails, that branch scores zero; 4) Non-critical items get weighted averages; 5) Apply this to both SQL logic and result CSV. Why it matters: Without it, a model could get good-looking answers for the wrong patients. 🍞 Example: If the SQL forgets to enforce 'female 70–80' as critical, the whole cohort check fails.

Before vs after: Before CLINSQL, models could do okay on simpler questions and still look strong on execution-only metrics. After CLINSQL, we see whether they can truly handle clinical-grade tasks—defining precise cohorts, handling time windows, mapping medical codes, and formatting outputs correctly.

Why it works (intuition, no equations):

Force the model to do the right steps in the right order (cohort → codes → joins → timing → analytics) and verify them separately.
Allow valid alternative SQLs but insist on clinically necessary parts with critical checks.
Check both the 'how' (SQL) and 'what' (results) to catch silent logic mistakes.

Building blocks:

Six realistic clinical scenario families (admissions/demographics, vitals, labs, medications, procedures, diagnoses & outcomes).
Expert-authored gold SQL for each question with correct joins, codes, time logic, and analytics.
Two rubric trees per task: one for SQL construction, one for executed results (including plausibility ranges).
Critical-first, weighted, and sequential scoring to mirror real clinical must-haves.
Long-context prompts and Chain-of-Thought with self-refinement to help models reason step-by-step.
A schema-hinted mode that highlights validated ICD filters and expected output columns, reducing common errors like cohort drift.

🍞 Think of a museum guide giving you the exact path to see the highlights first. 🥬 What it is: Schema-hinted inference gives models gentle guardrails—validated code filters and required column names. How it works: 1) Extract ICD/code patterns and expected columns from references; 2) Add them to the prompt; 3) The model aligns filters and SELECT aliases accordingly. Why it matters: Without hints, models often loosen filters or rename outputs incorrectly; hints keep them on track. 🍞 Example: Reminding the model 'Use ICD-10 I21% for AMI' and 'Columns must be [readmission_rate_30_day_pct, ...]'.

03Methodology

At a high level: Natural-language question → Cohort definition + code mapping → Multi-table joins → Temporal filters → Analytics/aggregations → Clean CSV output → Rubric-based scoring (SQL + results).

Each step, like a recipe:

Understand the question

What happens: Parse demographic constraints (age, gender, insurance), condition codes (ICD), time windows (first 24h), and requested metrics (rates, medians, IQRs).
Why it exists: If you misread the ask, you’ll compute the wrong thing even with perfect SQL.
Example: 'Among females 76–86 with principal AMI transferred from another hospital, report 30-day readmission rate and LOS stats.'

Patient-similarity cohort construction

What happens: Apply demographic filters, admission attributes, and diagnosis coding to pick who’s in.
Why it exists: Wrong cohorts produce misleading outcomes.
Example: Female, 76–86, Medicare, transfer from outside hospital, principal AMI (ICD-9 410% or ICD-10 I21%).

Medical concept implementation (codes and items)

What happens: Choose correct diagnosis codes (ICD-9 vs ICD-10), lab itemids, procedure event types, and unit/range checks.
Why it exists: Codes disambiguate concepts precisely; mixing versions or labels leads to drift.
Example: ACS defined via ICD-9 410%/411.1 and ICD-10 I20.0/I21%/I22%.

Database integration (multi-table joins)

What happens: Correctly join patients ↔ admissions ↔ ICU stays ↔ labs/procedures with appropriate keys (subject_id, hadm_id, stay_id). Use LEFT JOINs to avoid losing cohort rows when appropriate.
Why it exists: Joins are where many errors occur (row loss, duplication).
Example: Join acs_cohort to labevents on hadm_id to fetch the first hs-TnT.

Temporal reasoning

What happens: Limit events to windows (e.g., first 24h values or 72h procedures) using charttime/intime/admittime and consistent BigQuery time functions.
Why it exists: Clinical meaning often depends on when something happens (early vs late).
Example: Count diagnostic procedures between ICU intime and intime + 72 hours.

Clinical analytics (grouping and metrics)

What happens: Compute rates (mortality%), medians (APPROX_QUANTILES with offset), IQRs, stratifications (quartiles, quintiles), and LOS, with careful denominators and casting.
Why it exists: Mis-specified aggregations lead to nonsense stats.
Example: For PE cohort, compute 90-day mortality per risk quintile and median survivor LOS.

Output formatting

What happens: Produce a CSV with exact required column names, types, rounding, and no NULLs in answer columns.
Why it exists: Downstream tools and evaluators depend on exact schemas.
Example: Columns like readmission_rate_30_day_pct, median_los_readmitted_days, ...

Secret sauce: Rubric-based evaluation with critical-first scoring

SQL rubric tree verifies cohort, codes, joins, timing, and analytics in a structured way. Critical leaves (like correct ICD filter) must pass.
Results rubric tree checks the executed CSV: column presence/order, non-NULLs, types, and plausibility ranges (acceptable vs plausible bands).
A judge LLM compares candidate SQL/CSV with rubrics and gold references, recording leaf-level rationales and aggregating scores with a critical-first rule.

Prompts and refinement:

Chain-of-Thought: Models reason step-by-step, then output BigQuery SQL.
Self-refinement: If execution fails, feed back the error and ask for a minimal fix (up to two rounds).
Direct-output mode: Output only the SQL block. CoT generally outperforms direct output.

Schema-hinted mode:

Augment CoT prompts with validated ICD filters and expected final columns.
Reduces cohort drift and column mismatch, particularly on Medium/Hard tasks.

Common pitfalls (and what breaks without each step):

Skip proper codes → cohort drift (wrong patients).
Wrong joins → lost or duplicated admissions.
Mixed time types (TIMESTAMP vs DATETIME) → execution errors.
No temporal window → irrelevant events counted.
Wrong denominators/aliases → misreported rates or failed schema checks.

🍞 Imagine getting a report card that checks both your work and your final answer. 🥬 What it is: SQL score vs Execution score. How it works: SQL score checks the construction logic via the rubric; Execution score checks the run CSV and its plausibility. Why it matters: Two views catch more errors: you can run successfully but still fail critical clinical rules, or vice versa. 🍞 Example: Two different SQLs both compute correct 30-day readmission, but only one cleanly meets all rubric checks for cohort and outputs.

04Experiments & Results

The test: Measure how well models translate clinical questions into BigQuery SQL that both (a) follows critical clinical rules (SQL score) and (b) executes to a correct, plausible result CSV (Execution score). Why? Because clinical reliability requires both correct logic and correct outputs.

The competition: 22 models, including proprietary models (GPT-5 series, Gemini-2.5 series, GPT-4.1, o4-mini, Grok-4 Fast variants, Mistral-Medium), open-source general LLMs (DeepSeek-R1, DeepSeek-V3.1, Qwen3 series, Llama-4 variants, Baichuan-M2-32B), and specialized models (SQLCoder-7B-2, MedGemma-27B).

Scoreboard (context added):

GPT-5-mini leads average test Execution at 74.7% (like scoring a solid A when others average B/C). Hard split still under 70% (69.7%), showing room to grow.
Gemini-2.5-Pro drops from 85.5% on Easy to 67.2% on Hard (a big challenge jump).
DeepSeek-R1 leads open-source with about 69% Execution, beating some proprietary baselines, but still trails the best proprietary model by ~5.5 points.
Execution often outruns SQL score, meaning multiple SQLs can yield acceptable results even if some construction details differ.

Surprising (and helpful) findings:

Chain-of-Thought generally improves Execution over direct output. Gains vary by model: some get small bumps; others see meaningful jumps.
Schema-hinted CoT (with validated ICD filters and expected column names) boosts both SQL and Execution, especially on Medium/Hard, where constraints matter most.

Error patterns (why models fail):

Cohort specification & coding (about 54%): Models relax ICD/item constraints or use keyword heuristics, broadening the cohort incorrectly.
Output schema & formatting (~24%): Missing or misnamed columns, nulls in answer fields, or type/rounding mismatches.
Aggregation & clinical statistics (~14%): Wrong denominators, using quartiles instead of requested percentiles, or mixing time types.
Temporal boundary mistakes also appear (e.g., counting events outside the first 72 hours).

What changes because of this benchmark?

We can now quantify clinical-grade text-to-SQL reliability, not just generic database skill.
The gap between easy and hard tasks becomes visible and measurable, guiding targeted improvements (e.g., better code mapping and time logic).
Simple prompting tweaks (CoT, schema hints) produce measurable, repeatable gains, giving practitioners practical levers to improve systems.

05Discussion & Limitations

Limitations (clear-eyed view):

Single health system and one SQL engine: CLINSQL uses MIMIC-IV v3.1 and targets BigQuery SQL. Results may not fully generalize to other hospitals, schemas (OMOP/Cerner/Epic), or SQL dialects.
Expert-heavy curation: High-quality gold SQL, scenario crafting, and rubric rationales require clinical and data expertise, which is slower and costlier to scale.
Long-context reliance: Some models need long prompts for rubrics and schema; memory limits or truncation can hurt performance.
Evaluation via an LLM-as-judge: While guided by concrete rubrics and execution checks, automated judging can still have edge cases.

Resources needed:

Access to MIMIC-IV v3.1 (or a comparable EHR), a BigQuery environment, and the CLINSQL dataset/rubrics.
Models with sufficient context windows and reliable execution/refinement workflows.

When not to use:

If your database schema or SQL dialect is very different and you cannot adapt the mappings.
If you need direct patient-level decisions in real time; CLINSQL focuses on analytics cohorts and research-like queries, not bedside alerts.

Open questions:

How to generalize across health systems and schemas with minimal re-annotation?
Can we automate some expert steps (e.g., code extraction, rubric drafting) without losing clinical fidelity?
What training or fine-tuning best reduces cohort drift and time-window mistakes?
How to align models with regulatory and audit needs (traceability of cohort and code logic)?

06Conclusion & Future Work

Three-sentence summary: CLINSQL is a new benchmark that makes AI tackle real clinical text-to-SQL by requiring patient-similarity cohorts, temporal logic, and correct medical codes over the MIMIC-IV database. It grades models with rubric trees that check both SQL logic and the executed results using a critical-first strategy. Across 22 models, even the best systems still struggle on the hardest tasks, revealing what’s needed for clinically reliable analytics.

Main achievement: Turning vague, execution-only success into clinically grounded reliability checks that demand correct cohorts, time windows, and outputs—then showing, with data, how current models measure up.

Future directions:

Port CLINSQL-style scenarios and rubrics to other EHR systems and SQL dialects; explore schema-agnostic code mapping.
Develop training methods (e.g., rubric-aware fine-tuning, code-aware prompting) that reduce cohort drift and temporal errors.
Expand rubric coverage for more clinical specialties and multimodal signals (notes, imaging).

Why remember this: CLINSQL reframes text-to-SQL for healthcare’s real needs—cohorts, time, and codes—and gives a trustworthy yardstick to track progress toward safe, useful EHR analytics.

Practical Applications

•Evaluate hospital analytics assistants by running them on CLINSQL tasks to spot cohort, coding, and timing weaknesses.
•Use the rubric trees as guardrails in production: block answers that fail critical cohort or coding checks.
•Adopt schema-hinted prompting (validated ICD filters and expected columns) to boost reliability on harder queries.
•Fine-tune or instruct-train models with CLINSQL gold SQL and rubric rationales to reduce cohort drift.
•Build internal dashboards that generate SQL with CoT and auto-validate outputs against results rubrics before publishing.
•Create a clinical query design checklist (cohort, codes, time, joins, analytics) inspired by CLINSQL’s SQL rubric.
•Benchmark vendor models fairly during procurement using the same CLINSQL scenarios and scoring.
•Train data analysts using CLINSQL examples to learn correct joins, time windows, and clinical plausibility checks.
•Automate error triage: map failed runs to rubric leaves (e.g., ICD mismatch) to direct targeted fixes.
•Port the framework to local EHR schemas by re-mapping codes and regenerating rubrics for site-specific validation.

Version: 1