Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction
Key Summary
- âąThis paper asks a simple question with big impact: Can AI tell which test questions are hard for humans?
- âąAcross 20+ AI models and four big exam areas, the answer is mostly noâAIs agree with each other more than with real students.
- âąEven super-strong models canât reliably pretend to be weaker students; they still solve lots of questions that stump humans.
- âąUsing Item Response Theory (IRT), the authors show a 'curse of knowledge': whatâs hard for people is often easy for models.
- âąTrying personas like 'low-proficiency student' only helps a little and inconsistently; ensemble tricks help a bit but hit a ceiling.
- âąModels also lack self-awareness: their difficulty ratings barely predict when they themselves will be wrong (AUROC â 0.55).
- âąBigger models arenât automatically better aligned; they form a 'machine consensus' that drifts away from human reality.
- âąSpearman correlations with real difficulty are weak overall (often below 0.5; as low as 0.13 on USMLE).
- âąTakeaway: Solving a problem is different from sensing how hard it feels to a human, and todayâs AIs mostly miss that feeling.
Why This Research Matters
If AI canât feel whatâs hard for humans, learning apps can accidentally frustrate or bore students. Teachers and test designers need quick, trustworthy difficulty estimates for new questions, especially when thereâs no student data yet. Todayâs models often call human-hard items âeasy,â so automatic leveling can misfire. This affects homework targeting, exam fairness, and how quickly new curricula can roll out. Fixing this will make adaptive learning more supportive, test prep more efficient, and educational technology more humane. It also pushes AI research toward empathy-like modeling of human limits, not just raw problem-solving. That shift could improve many human-centered AI tools beyond education.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: You know how teachers look at which questions most kids miss to decide whatâs hard and what to reteach? Thatâs because understanding difficulty is the compass for good teaching.
đ„Ź Filling: What it is. Item Difficulty Prediction (IDP) is the job of figuring out how hard each question is for real students so we can build fair tests, smart study plans, and adaptive practice. How it works (old world):
- Give a question to many students.
- See who gets it right or wrong.
- Use math models (like Item Response Theory) to estimate a difficulty number. Why it matters: Without good difficulty ratings, tests feel unfair, adaptive apps mislevel students, and teachers canât scaffold learning well.
đ Anchor: Imagine 100 students try a math question and only 20 get it rightâthatâs a strong signal itâs hard; an app should show it later, not first.
The world before: Schools and testing companies usually needed big field tests to estimate difficulty. That means months of collecting answers, analyzing data, and calibrating numbers before a new question can be trusted. This created a cold start problem: new questions canât be used in adaptive systems until theyâre trialed on lots of students.
Failed attempts: Researchers tried text features (like vocabulary level or sentence length), then small transformer models (like BERT) fine-tuned to predict difficulty from question text. These worked sometimes but needed labeled history from the same domain, and often didnât transfer well.
The big hope: Large Language Models (LLMs). Theyâre great at answering questions, so maybe they could also guess how hard a question feels to humansâeven without student data. If true, that would solve cold start: write a new question, ask an LLM to rate difficulty, and deploy.
The real problem: Solving is not sensing. A model can ace a question and still not understand why itâs hard for an average student. Think of a chess master guessing which moves are hard for beginnersâthatâs tricky.
đ Hook: Imagine a strong swimmer judging how scary the deep end feels to a new swimmer. Their strength makes it hard to âfeelâ the fear.
đ„Ź Filling: What it is. Human-AI Difficulty Alignment is making sure AIâs difficulty ratings match what actual students experience. How it works:
- Ask the model to rate question difficulty (as an observer).
- Compare its ratings with ground truth from real test-takers.
- Check correlations (does it rank hard vs. easy like humans do?). Why it matters: If AI misreads difficulty, adaptive apps will frustrate learners; content creators canât control question levels; and exams might become unfair.
đ Anchor: If humans think a passage question is âhardâ but AI says âeasyâ, the app may give it too earlyâleading to confusion.
In practice, the authors evaluated 20+ models across medical exams (USMLE), language (Cambridge), and SAT Reading/Writing and Math. They tested two sides: perception (how models rate difficulty when shown the right answer) and capability (models taking the test without answers). They also tried role-play prompts: âact like a low/average/high proficiency student.â
The gap they found is important in daily life: Learning apps suggesting too-tough tasks can demotivate kids; teachers need trustworthy auto-generated practice at the right level; test makers need fair calibration quickly. If models donât âfeelâ what humans feel, those tools wonât help the way we expect.
02Core Idea
đ Hook: Imagine two judges of a race courseâone is a cheetah and one is a kid learning to run. The cheetah can finish any course, but that doesnât mean it knows which parts feel hard to the kid.
đ„Ź Filling: The aha! Models that solve problems very well donât automatically understand which problems feel hard to humans. How it works:
- Measure how models rank question difficulty versus real student data.
- Also measure how hard questions are for models themselves (using IRT across many models).
- Compare both with human difficulty to spot alignmentâor misalignment. Why it matters: If we assume âgreat solver = great judge of human difficulty,â our educational tools will misplace content and lose trust.
đ Anchor: A top chess engine knows the best move instantly, but that doesnât tell a coach which ideas a 6th grader finds confusing.
Three analogies:
- Lifeguard vs. swimmer: Being able to swim fast (solve) is different from sensing when someone is struggling (difficulty perception).
- Mountain goat vs. hiker: A goat scales cliffs easily but canât tell which path scares a first-time hiker.
- Calculator vs. student: A calculator never feels âthis long division is hard,â even if a student does.
Before vs. After:
- Before: Many hoped bigger, smarter LLMs would naturally predict human difficulty well.
- After: Bigger often means more âmachine consensusââmodels agree with each other more than with humans. High performance can make models worse at guessing whatâs hard for people.
Why it works (intuition):
- Perception needs empathy-like modeling of human limitations, not just knowledge.
- Models form internal shortcuts and patterns unlike human learning. So the things that trip students (vocabulary, subtle distractors, multi-step mental load) may not slow models.
- Even when you tell a model to âact weaker,â its strong internal knowledge leaks throughâthe curse of knowledge.
Building blocks (each explained with a Sandwich):
-
đ Hook: You know how teachers rank questions from easiest to hardest after grading? đ„Ź IDP (Item Difficulty Prediction): What it is: Estimating how hard each question is for students. How: 1) Look at question text and/or student results. 2) Predict a difficulty score. 3) Use it to organize lessons/tests. Why: Without it, learning paths and tests feel bumpy and unfair. đ Anchor: A learning app chooses the next math problem by reading its difficulty rating first.
-
đ Hook: A coach judges both player skill and play difficulty from game stats. đ„Ź IRT (Item Response Theory): What it is: A way to model the chance a student gets a question right based on student ability and item difficulty. How: 1) Assume each student has an ability number. 2) Each question has a difficulty number. 3) Use a curve to predict right/wrong odds. 4) Fit those numbers from many results. Why: Without IRT, difficulty depends too much on who was in the sample that day. đ Anchor: A very easy item is solved by almost everyone, no matter their ability; a very hard one only by high-ability students.
-
đ Hook: A robot and a kid each pick the âhardestâ Lego setâdo they pick the same? đ„Ź Human-AI Difficulty Alignment: What it is: Checking if AIâs difficulty ratings match humansâ. How: 1) Ask AI to rate. 2) Compare with human-based ground truth. 3) Measure ranking similarity (Spearman correlation). Why: If they donât match, AI guidance wonât fit real learners. đ Anchor: If AI calls many human-hard items âeasy,â an app could wrongly rush students.
-
đ Hook: Sometimes a group of friends copy each otherâs opinions. đ„Ź Machine Consensus: What it is: Models agree with each other more than with humans. How: 1) Compare model-to-model difficulty rankings. 2) See stronger agreement within models than with real student data. Why: Groupthink hides the real human struggle and misleads design. đ Anchor: Five models call a tricky reading question âeasy,â but student data says itâs âhard.â
-
đ Hook: Acting in a school play as a âbeginnerâ doesnât make you actually forget math. đ„Ź Proficiency Simulation: What it is: Prompting a model to act like a low/medium/high-proficiency student to rate difficulty. How: 1) Give a role (âweak studentâ). 2) Have it rate or answer. 3) Optionally average across roles (ensemble). Why: If this worked, models could mimic students to predict difficulty. But itâs inconsistent. đ Anchor: The paperâs personas changed distributions a bit but didnât reliably fix misalignment.
-
đ Hook: A puzzle champ canât tell which riddles confuse beginners. đ„Ź Capability-Perception Gap: What it is: The mismatch between what a model can solve and what it thinks is hard for humans. How: 1) Measure model accuracy (actor view). 2) Measure its difficulty ratings (observer view). 3) Compare both to human truth. Why: If big, predictions wonât guide real learners well. đ Anchor: A model aces SAT Math but canât sense why an average student struggles on multi-step algebra.
-
đ Hook: A math teacher forgets how tricky fractions felt at first. đ„Ź Curse of Knowledge: What it is: Strong models canât âunseeâ answers, so they fail to simulate struggling students. How: 1) Prompt to âact weak.â 2) Accuracy barely drops. 3) Many âhard-for-humansâ are âeasy-for-modelsâ (high Savant rates). Why: Without overcoming this, difficulty estimates stay optimistic. đ Anchor: On USMLE, over two-thirds of top-third-hard human items were solved by 90% of models.
-
đ Hook: Before a test, you guess which problems youâll miss. đ„Ź Metacognitive Alignment: What it is: Do a modelâs difficulty scores predict when it will be wrong? How: 1) Mark which items the model missed. 2) See if it gave those higher difficulty (AUROC). Why: Without self-awareness, its âhardâ label wonât warn us reliably. đ Anchor: Most models score near 0.55 AUROCâbarely better than chance.
03Methodology
At a high level: Input (question text + correct answer for perception, or no answer for solving) â Model either rates difficulty (observer) or answers the question (actor) â We compare to human ground truth using statistics.
Step A: Observer View (Perceived Difficulty)
- What happens: The model sees the full item (including the correct answer) and outputs a difficulty rating. This isolates âjudging how hardâ from âtrying to solve it.â
- Why this step exists: We want the modelâs sense of human struggle, not whether it can solve it.
- Example: For a Cambridge reading item, the model outputs: âDifficulty = 62/100.â We compare to the real IRT-based difficulty.
Step B: Actor View (Intrinsic Capability)
- What happens: The model takes the question without the answer, tries to solve it, and we mark correct/incorrect.
- Why this step exists: To see what actually feels hard to the model itself, so we can compare model-experienced difficulty to human difficulty.
- Example: On an SAT Math problem, a model gets it right. We store a 1; if wrong, a 0.
Step C: IRT Across Models (Machine Difficulty)
- What happens: Treat each model like a âstudentâ and fit an IRT Rasch model to estimate each itemâs âmachine difficultyâ from the correctness matrix.
- Why this step exists: It tells us, independent of one modelâs opinion, which items are easy or hard for the model population.
- Example: A USMLE item that almost all models get right has a low machine difficulty.
Step D: Proficiency Simulation (Personas)
- What happens: Use prompts to make models âactâ like low/medium/high-proficiency students when rating difficulty or answering.
- Why this step exists: If models can authentically role-play, their ratings might align better with real students.
- Example: The âlow-proficiencyâ persona might try to rate a reading item as âhardâ due to vocabulary and inference steps.
Step E: Ensembling Predictions
- What happens: Average difficulty ratings across top models, or average the same modelâs ratings across different personas.
- Why this step exists: To denoise random variation and see if groups do better than individuals.
- Example: Average GPT-4.1, QWQ-32B, and DeepSeek-R1 predictions to see if correlation rises.
Secret Sauce (why the approach is clever):
- It separates perception (how hard it seems) from capability (what models can do), then adds a self-check (metacognition). This three-way view makes misalignment visible and measurable.
Mini-Sandwiches for the key tools:
-
đ Hook: You and your friend each line up books from easiest to hardest. Do your orders match? đ„Ź Spearman Correlation: What it is: A score that tells how similarly two rankings order items. How: 1) Rank both lists. 2) Compare positions. 3) High score means âwe agree on the order.â Why: It focuses on âwhich is harder than whichâ even if scales differ. đ Anchor: If both you and the AI say Item A is harder than Item B most of the time, Spearman is high.
-
đ Hook: A coach estimates how likely each player scores on a shot based on skill. đ„Ź Rasch Model (IRT 1PL): What it is: A simple IRT model that uses one item parameter (difficulty) and one person parameter (ability). How: 1) Assume each model has an ability number. 2) Each item has a difficulty number. 3) A logistic curve maps their difference to success odds. 4) Fit numbers from the right/wrong data. Why: It cleanly separates âwho you areâ from âhow hard this is.â đ Anchor: A high-ability model has a high chance to solve an average item; a low-ability model struggles on high-difficulty items.
-
đ Hook: Before a quiz, you predict whether youâll miss a question. đ„Ź AUROC (for metacognition): What it is: A number showing how well âdifficulty scoresâ separate the items youâll miss from those youâll get right. How: 1) Label missed vs. solved. 2) See how often âmissedâ items got higher difficulty scores. 3) 0.5 = guesswork; nearer 1 = strong self-awareness. Why: If the model canât flag its own likely failures, its difficulty sense isnât grounded. đ Anchor: AUROC â 0.55 means âbarely better than guessingâ which items it will miss.
Concrete example flow (Cambridge reading item):
- Observer: Model reads passage+options+correct answer, rates difficulty 58/100.
- Actor: Another run without the answer; model picks C (correct).
- Machine IRT: Across 21 models, 90% get it right â machine difficulty is low.
- Compare: Humans rated it mid-hard; modelâs perception says mid, but machine difficulty says easy. That mismatch flags misalignment.
What breaks without each step:
- Without Observer: Weâd only know what models can do, not what they think humans feel.
- Without Actor/IRT: Weâd miss the curse of knowledge (human-hard but model-easy).
- Without Personas/Ensembles: We wouldnât test or reduce variance in âacting like students.â
- Without AUROC: We couldnât judge self-awareness, so weâd trust shaky difficulty labels.
04Experiments & Results
The tests and why: The authors measured whether modelsâ difficulty rankings match human reality (Spearman correlation), whether items are easy/hard for models as a group (IRT), and whether models know when theyâll be wrong (AUROC).
The competition: 20+ models, including frontier proprietary (e.g., GPT-4.1, GPT-5) and strong open models (e.g., QWQ-32B, DeepSeek-R1), across four domains: USMLE (medical), Cambridge (reading proficiency), SAT Reading/Writing, SAT Math.
Scoreboard with context:
- Perception alignment is weak overall. Average Spearman correlations are often below 0.5. For USMLE itâs as low as â 0.13 (thatâs like guessing classmatesâ heights and getting the order mostly wrong). SAT Math does better (â 0.41 on average) but still far from perfect.
- Bigger isnât reliably better. Some frontier models donât beat strong baselines; models tend to agree with each other more than with humans (machine consensus).
- Ensembling helps but hits a ceiling. Greedy ensembles raised SAT Math correlation from â 0.56 to â 0.66 before weaker models dragged it downâlike adding too much water to good paint.
- Personas (proficiency simulation) are inconsistent. Single-persona prompts can help or hurt, but averaging across personas often improves stability. For instance, GPT-5âs average correlation rose from â 0.34 to â 0.47 when averaging personasâbetter, but still far from ideal.
The surprising âcurse of knowledgeâ (IRT findings):
- Machine IRT difficulty correlates even worse with human difficulty than the modelsâ own perceptions. That means the items that feel hard to humans are often easy for models at scale.
- Saturation: A huge fraction of items are solved by â„90% of models, flattening machine difficulty. USMLE saturation â 75.6% (very high); SAT Math â 54.6%.
- Savant items (human-top-33%-hard but solved by â„90% of models): USMLE â 70.4% (wow!), SAT Math â 32.2%, Cambridge â 22.1%, SAT Reading â 25.5%. Translation: whatâs truly tough for people often looks trivial to machines.
- Persona accuracy shifts are tiny (<1% usually). Even when asked to âbe weak,â models mostly keep acing many itemsâproof of the curse of knowledge.
Metacognitive blindness (AUROC):
- Most models hover around 0.50â0.60 AUROC (â 0.55 on average). Thatâs like barely better than guessing whether theyâll miss a problem.
- A few bright spots (e.g., GPT-5 hitting â 0.73 on Cambridge) show itâs possible in pockets, but overall self-awareness remains poor.
Bottom line: Across domains, the same pattern repeatsâmodels do not naturally feel human difficulty. They cluster around a machine view, canât convincingly pretend to be weaker, and donât accurately foresee their own mistakes.
05Discussion & Limitations
Limitations:
- The study uses zero-shot prompting for personas and does not fine-tune on real student error traces. Perhaps few-shot or fine-tuned approaches could align better with human difficulty.
- Only four domains were tested (though diverse); other subjects or formats (open-ended responses, diagrams) might change results.
- The Rasch (1PL) model assumes equal discrimination across items; adding more parameters (2PL/3PL) might refine machine difficulty estimates.
Required resources:
- Access to multiple LLMs (open and proprietary), compute to run large batches, and datasets with ground-truth human difficulty from field tests (rare but crucial).
When not to use this approach alone:
- High-stakes test calibration without human data. The misalignment is too large to rely solely on LLMs.
- Personalized learning paths for novices where over-challenging items can harm motivation.
- Domains where model super-knowledge (e.g., medical fact retrieval) likely masks true human difficulty.
Open questions:
- Can fine-tuning with real student traces (errors, partial work) enable authentic proficiency simulation?
- Can models be trained to âforgetâ or mask knowledge on command to overcome the curse of knowledge?
- Can we build explicit cognitive models (e.g., working memory limits, common misconceptions) into LLMs to better mirror human struggle?
- What uncertainty signals (beyond verbal confidence) best predict human-perceived difficulty?
- How do multimodal items (graphs, audio) affect alignment and metacognition?
06Conclusion & Future Work
Three-sentence summary: This paper shows that todayâs powerful language models can solve many questions but still donât feel which ones are hard for humans. They agree with each other more than with real students, canât convincingly act as weaker learners, and lack self-awareness about their own likely mistakes. So, solving isnât sensing, and automated difficulty prediction needs more than just bigger brains.
Main achievement: A clean, large-scale, three-lens evaluation (perception, capability via IRT, and metacognition via AUROC) that reveals systematic HumanâAI difficulty misalignment and the curse of knowledge.
Future directions: Fine-tune with real student traces and misconceptions; design âknowledge maskingâ or curriculum-aware training to enable authentic low-proficiency simulation; integrate cognitive constraints into model objectives; develop stronger uncertainty signals tied to failure prediction; and use richer IRT models to capture machine difficulty nuances.
Why remember this: Because education runs on âwhatâs hard for learners,â and this paper proves that raw AI strength doesnât equal human-aligned difficulty sense. Until we bridge that empathy-like gap, AI-made tests, tutors, and study plans may be impressively fastâbut not truly student-friendly.
Practical Applications
- âąUse ensemble-of-personas (low/mid/high) averages as a quick, low-cost stabilizer for LLM difficulty ratings, while recognizing limits.
- âąTriaging item banks: flag items that are human-hard but model-easy (savant items) for expert review before deploying to novices.
- âąHybrid calibration: blend small pilot student data with LLM predictions, weighting human data more where misalignment is known (e.g., USMLE-like domains).
- âąMetacognition monitoring: track AUROC over time to decide when to rely on or override model difficulty advice.
- âąDomain-aware prompting: add explicit cognitive-load cues (multi-step reasoning, distractor strength) to nudge models to notice human struggle factors.
- âąGuardrails in adaptive systems: cap how fast difficulty can rise when guided by model predictions to prevent over-challenging learners.
- âąData collection strategy: prioritize field-testing on items that models claim are âeasyâ but humans might find âhardâ based on content features.
- âąTeacher-in-the-loop: present AI difficulty plus a short rationale and let educators quickly approve/adjust levels at scale.
- âąFuture fine-tuning: train models on real student traces (errors, partial solutions) to learn authentic struggle patterns.
- âąBenchmark expansion: add multimodal items (graphs, diagrams) to reveal new misalignment patterns and improve calibration.