Humans and LLMs Diverge on Probabilistic Inferences
Key Summary
- â˘Humans often make guesses about the world that are likely but not certain, and this paper studies how humans and AI compare at doing that.
- â˘The authors built PROBCOPA, a set of 210 short stories with likely effects, and asked 25â30 people per story to rate how likely each effect was from 0 to 100.
- â˘Human answers formed smooth, graded patterns: not just âyesâ or âno,â but lots of in-between, and usually with one main peak per story.
- â˘Eight advanced reasoning AI models rarely picked âmiddleâ likelihoods and were much more confident than people, especially on uncertain cases.
- â˘Models matched people better on very obvious âlikelyâ or âunlikelyâ cases, but they diverged most on the messy, in-between ones.
- â˘Across many samples, models showed far less variation than humans, even when the temperature or âthinking timeâ was increased.
- â˘When models wrote out their thinking, they often compared multiple possible scenarios before answering, a common pattern across systems.
- â˘Combining (ensembling) model answers made them closer to human patterns, but still not at the human-to-human similarity level.
- â˘This work shows that testing AI only on strict right-or-wrong problems misses how humans actually reason with uncertainty in everyday life.
Why This Research Matters
Many real-life decisions depend on careful âhow likely is this?â thinking, not just yes-or-no answers. If AI systems skip the middle and act overconfident, they can mislead users in exactly the moments that need nuance. This paper shows how to measure and compare full distributions of human and model judgments, revealing where models fall short. With better evaluation, we can design training methods that encourage calibrated, middle-range use of the scale. That means safer advice in healthcare, more trustworthy forecasts in transportation and weather, and better teaching tools that model realistic uncertainty. In short, it pushes AI to be humbler, clearer, and more human-aware when the truth isnât black and white.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: You know how when you hear, âThere was an accident on the highway,â you might think, âTraffic is probably worse,â but you also know itâs not guaranteed? Thatâs everyday guessing with uncertainty.
𼏠The Concept (Probabilistic reasoning in daily life): Itâs making careful guesses about whatâs likely given limited information. How it works: 1) Start with a clue (the premise). 2) Use your background knowledge. 3) Consider different possibilities. 4) Pick how likely each one seems. Why it matters: Without this, weâd act like robots that only see black or white, missing the gray areas where real life happens.
đ Anchor: Hearing âItâs cloudy and windy,â you might say, âRain is somewhat likely,â not âIt will 100% rain,â because youâre judging probabilities.
The world before this paper: AI systems have gotten very good at tasks with clear right-or-wrong answers, like math problems or code tests. But our daily reasoning isnât like that. We often reason under uncertainty: there isnât a single correct solutionâonly options that are more or less likely. In natural language research, a common way to test understanding is Natural Language Inference (NLI): given a sentence (premise), decide if another sentence (hypothesis) must be true, must be false, or could be either. Prior datasets captured some uncertainty but often pushed models and annotators toward three buckets: very likely, very unlikely, or neutral. That misses the smooth, graded way people actually judge likelihood.
The problem: We didnât know how modern âreasoningâ language models behave when they must give a probability-like judgment rather than a definite fact. Do they show the same graded, varied patterns humans do? Or do they overcommit?
Failed attempts: 1) Existing NLI datasets often showed tri-modal human ratings, with few responses in the middle, so they werenât great for measuring graded human uncertainty. 2) Re-annotating big datasets with just 2â3 people per item produced unreliable averagesâif one said âvery likelyâ and another said âvery unlikely,â the mean looked âmedium,â even though nobody actually believed âmedium.â 3) Using model log-probabilities or internal scores isnât possible for many top reasoning models and might not reflect the final answer after they âthinkâ in chains.
The gap: We needed a carefully built dataset where: (a) each item invites a truly probabilistic judgment, (b) many humans (25â30) rate each item on a 0â100 scale, and (c) models give the same kind of 0â100 rating so we can directly compare distributions (not just single numbers). We also needed analysis tools to compare not only averages but the whole shape of the distributions.
Real stakes: This matters for any situation where AI advises people under uncertaintyâhealth tips, traffic predictions, educational feedback, or safety guidance. If a model avoids âmaybeâ and acts overconfident, it can mislead users exactly when careful, humble estimates are needed. Trust and safety depend on more than being right on obvious cases; they depend on handling the messy middle thoughtfully.
02Core Idea
đ Hook: Imagine a weather app that always says âsunnyâ or âstormâ and never says âpartly cloudy.â Youâd stop trusting it on tricky days.
𼏠The Concept (Key insight): The paperâs big idea is that humans and todayâs reasoning AIs diverge most when judging medium-likelihood, uncertain inferencesâhumans give graded, varied answers, while AIs tend to skip the middle and act overconfident. How it works: 1) Build a dataset (PROBCOPA) of short premises and plausible effects. 2) Collect many human 0â100 likelihood ratings per item. 3) Ask multiple top reasoning models to give 0â100 ratings, sampled many times. 4) Compare entire distributions, not just averages, using spread and distance measures. Why it matters: If AI canât reflect human-like uncertainty, it may give misleading advice where nuance is crucial.
đ Anchor: When told, âThere was an accident on the highway,â people often say âtraffic is likely worseââmaybe 70 out of 100. Models, however, often jump to near 90+ or down near 10, and rarely pick a middle value like 60.
Multiple analogies:
- Thermostat analogy: A good thermostat doesnât just say âhotâ or âcoldâ; it shows degrees. Humans give âtemperature-likeâ degrees of likelihood. Many models flip to extremes, ignoring the middle.
- Teacherâs grading analogy: A thoughtful teacher uses the full grade scale (A to F). Humans use the full 0â100 scale. Some models act like only A or F exist, skipping BâD.
- Voting analogy: In a town vote, people rarely all choose the same number; you see a spread. Humansâ answers form smooth hills; model answers cluster at edges, leaving the middle empty.
Before vs. after: Before, we mostly tested reasoning models on crisp, right-or-wrong tasks. After, we see that in probabilistic, everyday language cases, models donât mirror human uncertainty well, especially on in-between judgments.
Why it works (intuition, not equations):
- To capture how âspread outâ peopleâs opinions are, the authors use a measure of randomness (differential entropy). High entropy means lots of uncertainty and disagreement; low entropy means most people agree.
- To compare two whole distributions (humans vs. models), they use a âhow-much-work-to-rearrange-one-into-the-otherâ idea (Wasserstein distance). If the modelâs curve is far from the human curve, that distance is big.
- Together, these show that models align with humans on obvious items (very likely or very unlikely) but drift away in the fuzzy middle.
Building blocks:
- A new dataset (PROBCOPA) split from COPA effects-only items, giving 210 premise-effect pairs.
- 25â30 human ratings per pair on a 0â100 scale with a guidance legend.
- Eight state-of-the-art reasoning models, each sampled 30 times per item via a verbalized number.
- Distribution-level comparisons (entropy, Wasserstein distance), plus analysis of the modelsâ written âreasoning chains.â
đ Anchor: Think of pouring two piles of sand (human opinions vs. model opinions) and measuring how much effort it takes to reshape one pile into the other. On obvious items, the piles look similar. On tricky items, the piles look very different.
03Methodology
At a high level: Premise and possible effect (input) â Humans and models each give 0â100 likelihood scores (Step A and B) â Analyze spread and similarity of whole distributions (Step C) â Inspect model âreasoning chainsâ and behavior tweaks (Step D and E) â Conclusions (output).
đ Hook (PROBCOPA): Imagine a science fair where each project asks, âHow likely is X to happen after Y?â and lots of visitors place stickers from 0 to 100. You donât just see a winnerâyou see a landscape of opinions.
𼏠The Concept (What PROBCOPA is): PROBCOPA is a curated set of 210 short, English premise-effect pairs designed to invite graded, probabilistic judgments. How it works: 1) Start from the COPA dataset of plausible causes/effects. 2) Keep only effect items (to reduce causal estimation complexities). 3) Split each two-choice COPA problem into two separate premise-effect items, each judged on its own. 4) Recruit 25â30 native English speakers per item to rate likelihood from 0 to 100 using a simple guide. Why it matters: This produces rich, reproducible human distributions that reflect real uncertainty instead of forcing âthis or that.â
đ Anchor: Premise: âA drought occurred in the region.â Effect: âThe crops perished.â Many people might rate this around, say, 80â90 (likely), but not 100 (not guaranteed), creating a smooth hill of ratings.
Step A: Human annotations
- What happens: 328 participants, each scoring up to 30 items, used a slider from 0 (definitely not) to 100 (definitely). There were practice examples, guidance on ranges, and attention checks. Two validation rounds showed highly reproducible results.
- Why this step exists: We need a trustworthy gold picture of how humans actually spread their judgments across the scale.
- Example: For âThe team rigged the contest in their favor â They won,â many people will pick high values, but not all will go to 100, making a mostly single-peaked distribution.
đ Hook (Reasoning chains): Imagine showing your work on a math problem. Even if the final answer is a single number, the steps you took matter.
𼏠The Concept (Reasoning chains): These are the modelâs intermediate, written steps before the final number. How it works: 1) The model self-narrates possible scenarios. 2) It weighs them. 3) It outputs a single number in <answer> tags. Why it matters: The steps can reveal patterns in how models evaluate likelihoods (like considering alternatives). Without them, we miss clues about their decision-making.
đ Anchor: Premise: âThe girl looked pale.â Effect: âHer father read her a story.â A model might reason: âComforting is plausible, but alternatives include medicine or a doctor visit,â before giving a mid-range number.
Step B: Model elicitation via verbalized numbers
- What happens: For each item, each model is prompted to think and then give a single 0â100 number. The same numeric guide humans used is provided. This is repeated 30 times per item (default temperature), yielding a model distribution per item.
- Why this step exists: Many top models donât share internal probabilities, so we ask for their explicit estimate, matching the human format.
- Example: âThere was an accident on the highway â Traffic was worse than usual.â Humans might cluster around 60â85. Some models output 90+ repeatedly, avoiding 60â70.
đ Hook (Differential entropy): Think of guessing where jellybeans are on a line. If everyone points to almost the same spot, thereâs little mystery; if guesses are spread out, thereâs lots of mystery.
𼏠The Concept (Differential entropy): It measures how spread out a continuous set of numbers is (here: 0â100 scores). How it works: 1) Look at the whole distribution of scores. 2) Calculate its overall âspread of information.â 3) Higher values mean more dispersion; lower mean tighter agreement. Why it matters: Not all disagreements are equal. Entropy captures the shape of uncertainty better than simple variance when distributions have odd shapes. Without it, we might misread patterns of disagreement.
đ Anchor: If human ratings for an item mostly sit tightly between 80â90, entropy is low. If ratings scatter from 20 to 90, entropy is higher.
đ Hook (Wasserstein distance): Picture moving piles of sand to reshape one pile into another. The more sand you have to move, and the farther you move it, the bigger the distance.
𼏠The Concept (Wasserstein distance): It quantifies how different two full distributions are (human vs. model). How it works: 1) Treat each distribution like a pile of probability mass. 2) Compute the minimum work to transform one pile into the other. 3) Higher distance = less similar. Why it matters: We want to compare entire curves, not just averages. Without this, a similar average could hide very different shapes.
đ Anchor: If humans cluster around 70 and models cluster around 90, you must âmoveâ a lot of mass rightwardâdistance is large.
Step C: Analysis metrics and tests
- What happens: For each item, compute (a) human and model medians; (b) entropies of each; (c) Wasserstein distance between human and model distributions. Also test unimodality for human data.
- Why this step exists: It lets us see not just whoâs âon averageâ close, but how well the whole shape matches.
- Example: Many items show human unimodality (a single main hill), but modelsâ responses look bi-modal overall and skip the middle.
Step D: Reasoning-chain analysis
- What happens: Sample 100 chains across models and hand-inspect patterns. Also correlate chain length with human entropy and response time.
- Why this step exists: It reveals how models think and whether they âwork harderâ on harder (more uncertain) items.
- Example: Chains often explicitly consider alternative scenarios before picking a number; chain length correlates more with human disagreement than with human time-on-task.
Step E: Behavior variations and ensembling
- What happens: Try higher/lower temperature, increase âthinking budgetâ or reasoning effort, and add persona prompts. Also ensemble (combine) multiple modelsâ answers.
- Why this step exists: To see if tweaks make model distributions more human-like or more varied.
- Example: Higher temperature sometimes increases diversity but can produce unusable rambly outputs; more âthinkingâ doesnât change medians much; personas donât reach human-like variation; ensembling helps but doesnât reach the human-human similarity baseline.
Secret sauce: The clever part is matching humans and models on the exact same kind of answer (0â100), collecting many samples from both, and comparing full distributions with thoughtful metrics. This turns fuzzy âuncertaintyâ into measurable shapes you can line up side by side.
04Experiments & Results
The test: The authors measured how likely humans and models thought each effect was, given a premise, on a 0â100 scale. They cared about two things: 1) how spread out the answers were (entropy), and 2) how similar the whole shape of answers was between humans and models (Wasserstein distance). They also checked human reproducibility with two validation rounds and inspected model reasoning chains.
The competition: Eight advanced reasoning models were tested (e.g., Gemini-3, GPT-5, Claude Sonnet-4.5, Qwen3, Kimi-K2, GLM-4.6, DeepSeek-R1, Grok-4.1 Fast). Each model was sampled 30 times per item to get a distribution of scores.
The scoreboard with context:
- Human patterns: Across the whole dataset, humans used the full scale and produced many in-between ratings, not just extremes. For individual items, human distributions were almost always unimodal (one main hill), showing graded but focused judgments rather than split camps.
- Model patterns: Modelsâ overall responses were often bi-modal and avoided the middle of the scale. In plain terms, they âjumped to strong conclusionsâ and rarely said âsomewhat likely/unlikely.â GPT-5 was the least extreme of the set but still showed this avoidance.
- Alignment by difficulty: On very obvious items (clearly likely or clearly unlikely), modelsâ medians and distributions were closer to humans (smaller Wasserstein distances). On trickier, middle-range itemsâwhere humans disagreed more (higher entropy)âmodel-human distances were much larger. In school terms, models earned an âAâ on the easy questions and a âCâDâ on the tricky, in-between ones, while humans remained smooth and nuanced across the board.
- Variation gap: For every item, human responses showed more spread (higher entropy) than model samples. Even when model temperature was raised, models often either didnât reach human-like variety or began producing long, low-quality rambles without valid final numbers.
- Reasoning effort: Increasing the âthinking budgetâ or reasoning effort did not lead to statistically significant shifts in model medians on tested items, contrasting with some prior reports that âmore thinkingâ causes more overconfidence.
- Reasoning-chain insights: By hand-inspecting 100 chains, the authors found a common pattern: models frequently enumerate alternatives (âcould be X, could be YâŚâ) and then pick. Also, models tended to produce longer chains on items where humans disagreed more (entropy correlation up to about 0.50), but chain length correlated weakly with human response time.
- Ensembling: Combining answers across models made distributions closer to human ones, but still not as close as a new group of humans was to the original human group (i.e., the human-human baseline remained best).
Surprising findings:
- Human unimodality: Even where people disagreed, their ratings per item usually formed a single main hill rather than two opposing camps. That suggests consistent interpretations with natural uncertainty, not clashing interpretations.
- No âmedium consensusâ items: There were no items where people tightly agreed on a pure middle value; middle-likelihood items instead showed more spread and longer response timesâa signal that these are genuinely harder judgments.
- Persona prompting didnât fix diversity: Giving models different âpersonasâ slightly changed behavior but failed to reach human-level variation or human-like distributions.
Bottom line: Models align with humans on the obvious stuff but are overconfident and under-diverse on the uncertain, middle-likelihood terrain where real-life judgment is most needed.
05Discussion & Limitations
Limitations:
- Language scope: PROBCOPA is in English; results may differ across languages and cultures where background knowledge and conventions vary.
- Data source: Items come from COPA (an older dataset), although reframed. Itâs possible that some sentences exist in model pretraining data, which might subtly influence behavior, even if the probabilistic framing is new.
- Verbalized likelihoods: Asking models for 0â100 numbers is practical but raises faithfulness questionsâdo these numbers truly reflect internal beliefs, or are they influenced by prompting style?
- Black-box constraints: Many state-of-the-art reasoning models donât expose internal probabilities, limiting more direct comparisons.
Required resources:
- A crowdsourcing setup with clear instructions, attention checks, and fair compensation to collect reliable human distributions.
- Access to multiple top reasoning models (and batch APIs) to sample many outputs per item.
- Compute and statistical tools to estimate entropy and Wasserstein distances robustly.
When not to use this approach:
- If you need hard, verifiable labels (like a math answer key), this graded-likelihood setting is unnecessary.
- If your application cannot tolerate any ambiguity (e.g., certain safety-critical triggers), you might need deterministic rules in addition to probabilistic judgments.
- If you canât sample multiple model outputs, youâll miss distributional comparisons and may draw shallow conclusions from single answers.
Open questions:
- Can new training or alignment methods help models use the middle of the scale appropriately and reflect human-like uncertainty?
- Are there prompts or interfaces that elicit more faithful, stable 0â100 estimates without devolving into rambly outputs at higher temperatures?
- How do results change in other languages or domains (medicine, law, education), where background knowledge and stakes differ?
- Can we design better metrics than entropy and Wasserstein for nuanced, human-facing uncertainty comparisons?
- How do we measure and improve the faithfulness of reasoning chainsâdo they truly reflect the modelâs internal decision process?
Takeaway: As AI moves into human-facing roles, âHow sure are you?â matters as much as âAre you right?â This paper shows models still need work to reflect the richer, more cautious patterns of human uncertainty.
06Conclusion & Future Work
Three-sentence summary: The authors built PROBCOPA, a dataset of 210 premise-effect items with 25â30 human 0â100 likelihood ratings each, creating a reliable picture of human probabilistic judgments. Comparing these against eight advanced reasoning models, they found models match humans on obvious cases but avoid the middle and show far less variation, leading to large divergences on uncertain items. Analyzing model reasoning chains revealed a shared habit of considering alternatives, but tweaks like more âthinking timeâ or personas did not close the human-model gap.
Main achievement: Establishing a rigorous, distribution-level evaluation of probabilistic language inferences and revealing a consistent, meaningful divergence between human and model judgmentsâespecially on the messy middle where uncertainty is real.
Future directions: Explore training or alignment methods that encourage calibrated, middle-range use of the scale; test multilingual and domain-specific versions of PROBCOPA; develop better elicitation methods for model likelihood estimates and stronger checks on chain-of-thought faithfulness; and refine metrics for human-model uncertainty similarity.
Why remember this: Real life is not a worksheet with one right answerâit's a landscape of likelihoods. This paper shows that while todayâs reasoning AIs can shine on certainties, they stumble in the gray zones that people navigate every day. Measuring full distributions, not just single answers, is key to building AI that reasons more like usâcarefully, humbly, and helpfully.
Practical Applications
- â˘Build training curricula that reward models for calibrated, middle-range likelihoods on uncertain items.
- â˘Use PROBCOPA-style evaluation when deploying AI advisors in healthcare, education, or customer support to detect overconfidence.
- â˘Combine (ensemble) multiple models to reduce extremes and move closer to human-like distributions.
- â˘Design user interfaces that show distributions or confidence bands instead of a single point estimate.
- â˘Adopt prompting templates that explicitly ask models to consider alternatives before giving a number.
- â˘Monitor entropy and Wasserstein distance during model updates to ensure uncertainty handling doesnât regress.
- â˘Create domain-specific PROBCOPA variants (e.g., medical symptoms â possible outcomes) to tune and test calibration.
- â˘Flag high-divergence cases (large humanâmodel Wasserstein distance) for human review in critical applications.
- â˘Incorporate human-likeness penalties into RLHF/RLAIF pipelines to discourage edge-only responses.
- â˘Educate end-users with legends (like the 0â100 guide) so they understand what the numbers mean and why âmiddleâ is valuable.