Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

Ming Li; Han Chen; Yunze Xiao; Jian Chen; Hong Jiao; Tianyi Zhou

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

Intermediate

Ming Li, Han Chen, Yunze Xiao et al.12/21/2025

arXiv PDF

Key Summary

•This paper asks a simple question with big impact: Can AI tell which test questions are hard for humans?
•Across 20+ AI models and four big exam areas, the answer is mostly no—AIs agree with each other more than with real students.
•Even super-strong models can’t reliably pretend to be weaker students; they still solve lots of questions that stump humans.
•Using Item Response Theory (IRT), the authors show a 'curse of knowledge': what’s hard for people is often easy for models.
•Trying personas like 'low-proficiency student' only helps a little and inconsistently; ensemble tricks help a bit but hit a ceiling.
•Models also lack self-awareness: their difficulty ratings barely predict when they themselves will be wrong (AUROC ≈ 0.55).
•Bigger models aren’t automatically better aligned; they form a 'machine consensus' that drifts away from human reality.
•Spearman correlations with real difficulty are weak overall (often below 0.5; as low as 0.13 on USMLE).
•Takeaway: Solving a problem is different from sensing how hard it feels to a human, and today’s AIs mostly miss that feeling.

Why This Research Matters

If AI can’t feel what’s hard for humans, learning apps can accidentally frustrate or bore students. Teachers and test designers need quick, trustworthy difficulty estimates for new questions, especially when there’s no student data yet. Today’s models often call human-hard items ‘easy,’ so automatic leveling can misfire. This affects homework targeting, exam fairness, and how quickly new curricula can roll out. Fixing this will make adaptive learning more supportive, test prep more efficient, and educational technology more humane. It also pushes AI research toward empathy-like modeling of human limits, not just raw problem-solving. That shift could improve many human-centered AI tools beyond education.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how teachers look at which questions most kids miss to decide what’s hard and what to reteach? That’s because understanding difficulty is the compass for good teaching.

🥬 Filling: What it is. Item Difficulty Prediction (IDP) is the job of figuring out how hard each question is for real students so we can build fair tests, smart study plans, and adaptive practice. How it works (old world):

Give a question to many students.
See who gets it right or wrong.
Use math models (like Item Response Theory) to estimate a difficulty number. Why it matters: Without good difficulty ratings, tests feel unfair, adaptive apps mislevel students, and teachers can’t scaffold learning well.

🍞 Anchor: Imagine 100 students try a math question and only 20 get it right—that’s a strong signal it’s hard; an app should show it later, not first.

The world before: Schools and testing companies usually needed big field tests to estimate difficulty. That means months of collecting answers, analyzing data, and calibrating numbers before a new question can be trusted. This created a cold start problem: new questions can’t be used in adaptive systems until they’re trialed on lots of students.

Failed attempts: Researchers tried text features (like vocabulary level or sentence length), then small transformer models (like BERT) fine-tuned to predict difficulty from question text. These worked sometimes but needed labeled history from the same domain, and often didn’t transfer well.

The big hope: Large Language Models (LLMs). They’re great at answering questions, so maybe they could also guess how hard a question feels to humans—even without student data. If true, that would solve cold start: write a new question, ask an LLM to rate difficulty, and deploy.

The real problem: Solving is not sensing. A model can ace a question and still not understand why it’s hard for an average student. Think of a chess master guessing which moves are hard for beginners—that’s tricky.

🍞 Hook: Imagine a strong swimmer judging how scary the deep end feels to a new swimmer. Their strength makes it hard to ‘feel’ the fear.

🥬 Filling: What it is. Human-AI Difficulty Alignment is making sure AI’s difficulty ratings match what actual students experience. How it works:

Ask the model to rate question difficulty (as an observer).
Compare its ratings with ground truth from real test-takers.
Check correlations (does it rank hard vs. easy like humans do?). Why it matters: If AI misreads difficulty, adaptive apps will frustrate learners; content creators can’t control question levels; and exams might become unfair.

🍞 Anchor: If humans think a passage question is ‘hard’ but AI says ‘easy’, the app may give it too early—leading to confusion.

In practice, the authors evaluated 20+ models across medical exams (USMLE), language (Cambridge), and SAT Reading/Writing and Math. They tested two sides: perception (how models rate difficulty when shown the right answer) and capability (models taking the test without answers). They also tried role-play prompts: ‘act like a low/average/high proficiency student.’

The gap they found is important in daily life: Learning apps suggesting too-tough tasks can demotivate kids; teachers need trustworthy auto-generated practice at the right level; test makers need fair calibration quickly. If models don’t ‘feel’ what humans feel, those tools won’t help the way we expect.

02Core Idea

🍞 Hook: Imagine two judges of a race course—one is a cheetah and one is a kid learning to run. The cheetah can finish any course, but that doesn’t mean it knows which parts feel hard to the kid.

🥬 Filling: The aha! Models that solve problems very well don’t automatically understand which problems feel hard to humans. How it works:

Measure how models rank question difficulty versus real student data.
Also measure how hard questions are for models themselves (using IRT across many models).
Compare both with human difficulty to spot alignment—or misalignment. Why it matters: If we assume ‘great solver = great judge of human difficulty,’ our educational tools will misplace content and lose trust.

🍞 Anchor: A top chess engine knows the best move instantly, but that doesn’t tell a coach which ideas a 6th grader finds confusing.

Three analogies:

Lifeguard vs. swimmer: Being able to swim fast (solve) is different from sensing when someone is struggling (difficulty perception).
Mountain goat vs. hiker: A goat scales cliffs easily but can’t tell which path scares a first-time hiker.
Calculator vs. student: A calculator never feels ‘this long division is hard,’ even if a student does.

Before vs. After:

Before: Many hoped bigger, smarter LLMs would naturally predict human difficulty well.
After: Bigger often means more ‘machine consensus’—models agree with each other more than with humans. High performance can make models worse at guessing what’s hard for people.

Why it works (intuition):

Perception needs empathy-like modeling of human limitations, not just knowledge.
Models form internal shortcuts and patterns unlike human learning. So the things that trip students (vocabulary, subtle distractors, multi-step mental load) may not slow models.
Even when you tell a model to ‘act weaker,’ its strong internal knowledge leaks through—the curse of knowledge.

Building blocks (each explained with a Sandwich):

🍞 Hook: You know how teachers rank questions from easiest to hardest after grading? 🥬 IDP (Item Difficulty Prediction): What it is: Estimating how hard each question is for students. How: 1) Look at question text and/or student results. 2) Predict a difficulty score. 3) Use it to organize lessons/tests. Why: Without it, learning paths and tests feel bumpy and unfair. 🍞 Anchor: A learning app chooses the next math problem by reading its difficulty rating first.
🍞 Hook: A coach judges both player skill and play difficulty from game stats. 🥬 IRT (Item Response Theory): What it is: A way to model the chance a student gets a question right based on student ability and item difficulty. How: 1) Assume each student has an ability number. 2) Each question has a difficulty number. 3) Use a curve to predict right/wrong odds. 4) Fit those numbers from many results. Why: Without IRT, difficulty depends too much on who was in the sample that day. 🍞 Anchor: A very easy item is solved by almost everyone, no matter their ability; a very hard one only by high-ability students.
🍞 Hook: A robot and a kid each pick the ‘hardest’ Lego set—do they pick the same? 🥬 Human-AI Difficulty Alignment: What it is: Checking if AI’s difficulty ratings match humans’. How: 1) Ask AI to rate. 2) Compare with human-based ground truth. 3) Measure ranking similarity (Spearman correlation). Why: If they don’t match, AI guidance won’t fit real learners. 🍞 Anchor: If AI calls many human-hard items ‘easy,’ an app could wrongly rush students.
🍞 Hook: Sometimes a group of friends copy each other’s opinions. 🥬 Machine Consensus: What it is: Models agree with each other more than with humans. How: 1) Compare model-to-model difficulty rankings. 2) See stronger agreement within models than with real student data. Why: Groupthink hides the real human struggle and misleads design. 🍞 Anchor: Five models call a tricky reading question ‘easy,’ but student data says it’s ‘hard.’
🍞 Hook: Acting in a school play as a ‘beginner’ doesn’t make you actually forget math. 🥬 Proficiency Simulation: What it is: Prompting a model to act like a low/medium/high-proficiency student to rate difficulty. How: 1) Give a role (‘weak student’). 2) Have it rate or answer. 3) Optionally average across roles (ensemble). Why: If this worked, models could mimic students to predict difficulty. But it’s inconsistent. 🍞 Anchor: The paper’s personas changed distributions a bit but didn’t reliably fix misalignment.
🍞 Hook: A puzzle champ can’t tell which riddles confuse beginners. 🥬 Capability-Perception Gap: What it is: The mismatch between what a model can solve and what it thinks is hard for humans. How: 1) Measure model accuracy (actor view). 2) Measure its difficulty ratings (observer view). 3) Compare both to human truth. Why: If big, predictions won’t guide real learners well. 🍞 Anchor: A model aces SAT Math but can’t sense why an average student struggles on multi-step algebra.
🍞 Hook: A math teacher forgets how tricky fractions felt at first. 🥬 Curse of Knowledge: What it is: Strong models can’t ‘unsee’ answers, so they fail to simulate struggling students. How: 1) Prompt to ‘act weak.’ 2) Accuracy barely drops. 3) Many ‘hard-for-humans’ are ‘easy-for-models’ (high Savant rates). Why: Without overcoming this, difficulty estimates stay optimistic. 🍞 Anchor: On USMLE, over two-thirds of top-third-hard human items were solved by 90% of models.
🍞 Hook: Before a test, you guess which problems you’ll miss. 🥬 Metacognitive Alignment: What it is: Do a model’s difficulty scores predict when it will be wrong? How: 1) Mark which items the model missed. 2) See if it gave those higher difficulty (AUROC). Why: Without self-awareness, its ‘hard’ label won’t warn us reliably. 🍞 Anchor: Most models score near 0.55 AUROC—barely better than chance.

03Methodology

At a high level: Input (question text + correct answer for perception, or no answer for solving) → Model either rates difficulty (observer) or answers the question (actor) → We compare to human ground truth using statistics.

Step A: Observer View (Perceived Difficulty)

What happens: The model sees the full item (including the correct answer) and outputs a difficulty rating. This isolates ‘judging how hard’ from ‘trying to solve it.’
Why this step exists: We want the model’s sense of human struggle, not whether it can solve it.
Example: For a Cambridge reading item, the model outputs: ‘Difficulty = 62/100.’ We compare to the real IRT-based difficulty.

Step B: Actor View (Intrinsic Capability)

What happens: The model takes the question without the answer, tries to solve it, and we mark correct/incorrect.
Why this step exists: To see what actually feels hard to the model itself, so we can compare model-experienced difficulty to human difficulty.
Example: On an SAT Math problem, a model gets it right. We store a 1; if wrong, a 0.

Step C: IRT Across Models (Machine Difficulty)

What happens: Treat each model like a ‘student’ and fit an IRT Rasch model to estimate each item’s ‘machine difficulty’ from the correctness matrix.
Why this step exists: It tells us, independent of one model’s opinion, which items are easy or hard for the model population.
Example: A USMLE item that almost all models get right has a low machine difficulty.

Step D: Proficiency Simulation (Personas)

What happens: Use prompts to make models ‘act’ like low/medium/high-proficiency students when rating difficulty or answering.
Why this step exists: If models can authentically role-play, their ratings might align better with real students.
Example: The ‘low-proficiency’ persona might try to rate a reading item as ‘hard’ due to vocabulary and inference steps.

Step E: Ensembling Predictions

What happens: Average difficulty ratings across top models, or average the same model’s ratings across different personas.
Why this step exists: To denoise random variation and see if groups do better than individuals.
Example: Average GPT-4.1, QWQ-32B, and DeepSeek-R1 predictions to see if correlation rises.

Secret Sauce (why the approach is clever):

It separates perception (how hard it seems) from capability (what models can do), then adds a self-check (metacognition). This three-way view makes misalignment visible and measurable.

Mini-Sandwiches for the key tools:

🍞 Hook: You and your friend each line up books from easiest to hardest. Do your orders match? 🥬 Spearman Correlation: What it is: A score that tells how similarly two rankings order items. How: 1) Rank both lists. 2) Compare positions. 3) High score means ‘we agree on the order.’ Why: It focuses on ‘which is harder than which’ even if scales differ. 🍞 Anchor: If both you and the AI say Item A is harder than Item B most of the time, Spearman is high.
🍞 Hook: A coach estimates how likely each player scores on a shot based on skill. 🥬 Rasch Model (IRT 1PL): What it is: A simple IRT model that uses one item parameter (difficulty) and one person parameter (ability). How: 1) Assume each model has an ability number. 2) Each item has a difficulty number. 3) A logistic curve maps their difference to success odds. 4) Fit numbers from the right/wrong data. Why: It cleanly separates ‘who you are’ from ‘how hard this is.’ 🍞 Anchor: A high-ability model has a high chance to solve an average item; a low-ability model struggles on high-difficulty items.
🍞 Hook: Before a quiz, you predict whether you’ll miss a question. 🥬 AUROC (for metacognition): What it is: A number showing how well ‘difficulty scores’ separate the items you’ll miss from those you’ll get right. How: 1) Label missed vs. solved. 2) See how often ‘missed’ items got higher difficulty scores. 3) 0.5 = guesswork; nearer 1 = strong self-awareness. Why: If the model can’t flag its own likely failures, its difficulty sense isn’t grounded. 🍞 Anchor: AUROC ≈ 0.55 means ‘barely better than guessing’ which items it will miss.

Concrete example flow (Cambridge reading item):

Observer: Model reads passage+options+correct answer, rates difficulty 58/100.
Actor: Another run without the answer; model picks C (correct).
Machine IRT: Across 21 models, 90% get it right → machine difficulty is low.
Compare: Humans rated it mid-hard; model’s perception says mid, but machine difficulty says easy. That mismatch flags misalignment.

What breaks without each step:

Without Observer: We’d only know what models can do, not what they think humans feel.
Without Actor/IRT: We’d miss the curse of knowledge (human-hard but model-easy).
Without Personas/Ensembles: We wouldn’t test or reduce variance in ‘acting like students.’
Without AUROC: We couldn’t judge self-awareness, so we’d trust shaky difficulty labels.

04Experiments & Results

The tests and why: The authors measured whether models’ difficulty rankings match human reality (Spearman correlation), whether items are easy/hard for models as a group (IRT), and whether models know when they’ll be wrong (AUROC).

The competition: 20+ models, including frontier proprietary (e.g., GPT-4.1, GPT-5) and strong open models (e.g., QWQ-32B, DeepSeek-R1), across four domains: USMLE (medical), Cambridge (reading proficiency), SAT Reading/Writing, SAT Math.

Scoreboard with context:

Perception alignment is weak overall. Average Spearman correlations are often below 0.5. For USMLE it’s as low as ≈ 0.13 (that’s like guessing classmates’ heights and getting the order mostly wrong). SAT Math does better (≈ 0.41 on average) but still far from perfect.
Bigger isn’t reliably better. Some frontier models don’t beat strong baselines; models tend to agree with each other more than with humans (machine consensus).
Ensembling helps but hits a ceiling. Greedy ensembles raised SAT Math correlation from ≈ 0.56 to ≈ 0.66 before weaker models dragged it down—like adding too much water to good paint.
Personas (proficiency simulation) are inconsistent. Single-persona prompts can help or hurt, but averaging across personas often improves stability. For instance, GPT-5’s average correlation rose from ≈ 0.34 to ≈ 0.47 when averaging personas—better, but still far from ideal.

The surprising ‘curse of knowledge’ (IRT findings):

Machine IRT difficulty correlates even worse with human difficulty than the models’ own perceptions. That means the items that feel hard to humans are often easy for models at scale.
Saturation: A huge fraction of items are solved by ≥90% of models, flattening machine difficulty. USMLE saturation ≈ 75.6% (very high); SAT Math ≈ 54.6%.
Savant items (human-top-33%-hard but solved by ≥90% of models): USMLE ≈ 70.4% (wow!), SAT Math ≈ 32.2%, Cambridge ≈ 22.1%, SAT Reading ≈ 25.5%. Translation: what’s truly tough for people often looks trivial to machines.
Persona accuracy shifts are tiny (<1% usually). Even when asked to ‘be weak,’ models mostly keep acing many items—proof of the curse of knowledge.

Metacognitive blindness (AUROC):

Most models hover around 0.50–0.60 AUROC (≈ 0.55 on average). That’s like barely better than guessing whether they’ll miss a problem.
A few bright spots (e.g., GPT-5 hitting ≈ 0.73 on Cambridge) show it’s possible in pockets, but overall self-awareness remains poor.

Bottom line: Across domains, the same pattern repeats—models do not naturally feel human difficulty. They cluster around a machine view, can’t convincingly pretend to be weaker, and don’t accurately foresee their own mistakes.

05Discussion & Limitations

Limitations:

The study uses zero-shot prompting for personas and does not fine-tune on real student error traces. Perhaps few-shot or fine-tuned approaches could align better with human difficulty.
Only four domains were tested (though diverse); other subjects or formats (open-ended responses, diagrams) might change results.
The Rasch (1PL) model assumes equal discrimination across items; adding more parameters (2PL/3PL) might refine machine difficulty estimates.

Required resources:

Access to multiple LLMs (open and proprietary), compute to run large batches, and datasets with ground-truth human difficulty from field tests (rare but crucial).

When not to use this approach alone:

High-stakes test calibration without human data. The misalignment is too large to rely solely on LLMs.
Personalized learning paths for novices where over-challenging items can harm motivation.
Domains where model super-knowledge (e.g., medical fact retrieval) likely masks true human difficulty.

Open questions:

Can fine-tuning with real student traces (errors, partial work) enable authentic proficiency simulation?
Can models be trained to ‘forget’ or mask knowledge on command to overcome the curse of knowledge?
Can we build explicit cognitive models (e.g., working memory limits, common misconceptions) into LLMs to better mirror human struggle?
What uncertainty signals (beyond verbal confidence) best predict human-perceived difficulty?
How do multimodal items (graphs, audio) affect alignment and metacognition?

06Conclusion & Future Work

Three-sentence summary: This paper shows that today’s powerful language models can solve many questions but still don’t feel which ones are hard for humans. They agree with each other more than with real students, can’t convincingly act as weaker learners, and lack self-awareness about their own likely mistakes. So, solving isn’t sensing, and automated difficulty prediction needs more than just bigger brains.

Main achievement: A clean, large-scale, three-lens evaluation (perception, capability via IRT, and metacognition via AUROC) that reveals systematic Human–AI difficulty misalignment and the curse of knowledge.

Future directions: Fine-tune with real student traces and misconceptions; design ‘knowledge masking’ or curriculum-aware training to enable authentic low-proficiency simulation; integrate cognitive constraints into model objectives; develop stronger uncertainty signals tied to failure prediction; and use richer IRT models to capture machine difficulty nuances.

Why remember this: Because education runs on ‘what’s hard for learners,’ and this paper proves that raw AI strength doesn’t equal human-aligned difficulty sense. Until we bridge that empathy-like gap, AI-made tests, tutors, and study plans may be impressively fast—but not truly student-friendly.

Practical Applications

•Use ensemble-of-personas (low/mid/high) averages as a quick, low-cost stabilizer for LLM difficulty ratings, while recognizing limits.
•Triaging item banks: flag items that are human-hard but model-easy (savant items) for expert review before deploying to novices.
•Hybrid calibration: blend small pilot student data with LLM predictions, weighting human data more where misalignment is known (e.g., USMLE-like domains).
•Metacognition monitoring: track AUROC over time to decide when to rely on or override model difficulty advice.
•Domain-aware prompting: add explicit cognitive-load cues (multi-step reasoning, distractor strength) to nudge models to notice human struggle factors.
•Guardrails in adaptive systems: cap how fast difficulty can rise when guided by model predictions to prevent over-challenging learners.
•Data collection strategy: prioritize field-testing on items that models claim are ‘easy’ but humans might find ‘hard’ based on content features.
•Teacher-in-the-loop: present AI difficulty plus a short rationale and let educators quickly approve/adjust levels at scale.
•Future fine-tuning: train models on real student traces (errors, partial solutions) to learn authentic struggle patterns.
•Benchmark expansion: add multimodal items (graphs, diagrams) to reveal new misalignment patterns and improve calibration.

Version: 1