Humans and LLMs Diverge on Probabilistic Inferences

Gaurav Kamath; Sreenath Madathil; Sebastian Schuster; Marie-Catherine de Marneffe; Siva Reddy

Humans and LLMs Diverge on Probabilistic Inferences

Beginner

Gaurav Kamath, Sreenath Madathil, Sebastian Schuster et al.2/26/2026

arXiv

Key Summary

•Humans often make guesses about the world that are likely but not certain, and this paper studies how humans and AI compare at doing that.
•The authors built PROBCOPA, a set of 210 short stories with likely effects, and asked 25–30 people per story to rate how likely each effect was from 0 to 100.
•Human answers formed smooth, graded patterns: not just ‘yes’ or ‘no,’ but lots of in-between, and usually with one main peak per story.
•Eight advanced reasoning AI models rarely picked ‘middle’ likelihoods and were much more confident than people, especially on uncertain cases.
•Models matched people better on very obvious ‘likely’ or ‘unlikely’ cases, but they diverged most on the messy, in-between ones.
•Across many samples, models showed far less variation than humans, even when the temperature or ‘thinking time’ was increased.
•When models wrote out their thinking, they often compared multiple possible scenarios before answering, a common pattern across systems.
•Combining (ensembling) model answers made them closer to human patterns, but still not at the human-to-human similarity level.
•This work shows that testing AI only on strict right-or-wrong problems misses how humans actually reason with uncertainty in everyday life.

Why This Research Matters

Many real-life decisions depend on careful ‘how likely is this?’ thinking, not just yes-or-no answers. If AI systems skip the middle and act overconfident, they can mislead users in exactly the moments that need nuance. This paper shows how to measure and compare full distributions of human and model judgments, revealing where models fall short. With better evaluation, we can design training methods that encourage calibrated, middle-range use of the scale. That means safer advice in healthcare, more trustworthy forecasts in transportation and weather, and better teaching tools that model realistic uncertainty. In short, it pushes AI to be humbler, clearer, and more human-aware when the truth isn’t black and white.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you hear, “There was an accident on the highway,” you might think, “Traffic is probably worse,” but you also know it’s not guaranteed? That’s everyday guessing with uncertainty.

🥬 The Concept (Probabilistic reasoning in daily life): It’s making careful guesses about what’s likely given limited information. How it works: 1) Start with a clue (the premise). 2) Use your background knowledge. 3) Consider different possibilities. 4) Pick how likely each one seems. Why it matters: Without this, we’d act like robots that only see black or white, missing the gray areas where real life happens.

🍞 Anchor: Hearing “It’s cloudy and windy,” you might say, “Rain is somewhat likely,” not “It will 100% rain,” because you’re judging probabilities.

The world before this paper: AI systems have gotten very good at tasks with clear right-or-wrong answers, like math problems or code tests. But our daily reasoning isn’t like that. We often reason under uncertainty: there isn’t a single correct solution—only options that are more or less likely. In natural language research, a common way to test understanding is Natural Language Inference (NLI): given a sentence (premise), decide if another sentence (hypothesis) must be true, must be false, or could be either. Prior datasets captured some uncertainty but often pushed models and annotators toward three buckets: very likely, very unlikely, or neutral. That misses the smooth, graded way people actually judge likelihood.

The problem: We didn’t know how modern ‘reasoning’ language models behave when they must give a probability-like judgment rather than a definite fact. Do they show the same graded, varied patterns humans do? Or do they overcommit?

Failed attempts: 1) Existing NLI datasets often showed tri-modal human ratings, with few responses in the middle, so they weren’t great for measuring graded human uncertainty. 2) Re-annotating big datasets with just 2–3 people per item produced unreliable averages—if one said ‘very likely’ and another said ‘very unlikely,’ the mean looked ‘medium,’ even though nobody actually believed ‘medium.’ 3) Using model log-probabilities or internal scores isn’t possible for many top reasoning models and might not reflect the final answer after they ‘think’ in chains.

The gap: We needed a carefully built dataset where: (a) each item invites a truly probabilistic judgment, (b) many humans (25–30) rate each item on a 0–100 scale, and (c) models give the same kind of 0–100 rating so we can directly compare distributions (not just single numbers). We also needed analysis tools to compare not only averages but the whole shape of the distributions.

Real stakes: This matters for any situation where AI advises people under uncertainty—health tips, traffic predictions, educational feedback, or safety guidance. If a model avoids ‘maybe’ and acts overconfident, it can mislead users exactly when careful, humble estimates are needed. Trust and safety depend on more than being right on obvious cases; they depend on handling the messy middle thoughtfully.

02Core Idea

🍞 Hook: Imagine a weather app that always says ‘sunny’ or ‘storm’ and never says ‘partly cloudy.’ You’d stop trusting it on tricky days.

🥬 The Concept (Key insight): The paper’s big idea is that humans and today’s reasoning AIs diverge most when judging medium-likelihood, uncertain inferences—humans give graded, varied answers, while AIs tend to skip the middle and act overconfident. How it works: 1) Build a dataset (PROBCOPA) of short premises and plausible effects. 2) Collect many human 0–100 likelihood ratings per item. 3) Ask multiple top reasoning models to give 0–100 ratings, sampled many times. 4) Compare entire distributions, not just averages, using spread and distance measures. Why it matters: If AI can’t reflect human-like uncertainty, it may give misleading advice where nuance is crucial.

🍞 Anchor: When told, “There was an accident on the highway,” people often say ‘traffic is likely worse’—maybe 70 out of 100. Models, however, often jump to near 90+ or down near 10, and rarely pick a middle value like 60.

Multiple analogies:

Thermostat analogy: A good thermostat doesn’t just say ‘hot’ or ‘cold’; it shows degrees. Humans give ‘temperature-like’ degrees of likelihood. Many models flip to extremes, ignoring the middle.
Teacher’s grading analogy: A thoughtful teacher uses the full grade scale (A to F). Humans use the full 0–100 scale. Some models act like only A or F exist, skipping B–D.
Voting analogy: In a town vote, people rarely all choose the same number; you see a spread. Humans’ answers form smooth hills; model answers cluster at edges, leaving the middle empty.

Before vs. after: Before, we mostly tested reasoning models on crisp, right-or-wrong tasks. After, we see that in probabilistic, everyday language cases, models don’t mirror human uncertainty well, especially on in-between judgments.

Why it works (intuition, not equations):

To capture how ‘spread out’ people’s opinions are, the authors use a measure of randomness (differential entropy). High entropy means lots of uncertainty and disagreement; low entropy means most people agree.
To compare two whole distributions (humans vs. models), they use a ‘how-much-work-to-rearrange-one-into-the-other’ idea (Wasserstein distance). If the model’s curve is far from the human curve, that distance is big.
Together, these show that models align with humans on obvious items (very likely or very unlikely) but drift away in the fuzzy middle.

Building blocks:

A new dataset (PROBCOPA) split from COPA effects-only items, giving 210 premise-effect pairs.
25–30 human ratings per pair on a 0–100 scale with a guidance legend.
Eight state-of-the-art reasoning models, each sampled 30 times per item via a verbalized number.
Distribution-level comparisons (entropy, Wasserstein distance), plus analysis of the models’ written ‘reasoning chains.’

🍞 Anchor: Think of pouring two piles of sand (human opinions vs. model opinions) and measuring how much effort it takes to reshape one pile into the other. On obvious items, the piles look similar. On tricky items, the piles look very different.

03Methodology

At a high level: Premise and possible effect (input) → Humans and models each give 0–100 likelihood scores (Step A and B) → Analyze spread and similarity of whole distributions (Step C) → Inspect model ‘reasoning chains’ and behavior tweaks (Step D and E) → Conclusions (output).

🍞 Hook (PROBCOPA): Imagine a science fair where each project asks, ‘How likely is X to happen after Y?’ and lots of visitors place stickers from 0 to 100. You don’t just see a winner—you see a landscape of opinions.

🥬 The Concept (What PROBCOPA is): PROBCOPA is a curated set of 210 short, English premise-effect pairs designed to invite graded, probabilistic judgments. How it works: 1) Start from the COPA dataset of plausible causes/effects. 2) Keep only effect items (to reduce causal estimation complexities). 3) Split each two-choice COPA problem into two separate premise-effect items, each judged on its own. 4) Recruit 25–30 native English speakers per item to rate likelihood from 0 to 100 using a simple guide. Why it matters: This produces rich, reproducible human distributions that reflect real uncertainty instead of forcing ‘this or that.’

🍞 Anchor: Premise: ‘A drought occurred in the region.’ Effect: ‘The crops perished.’ Many people might rate this around, say, 80–90 (likely), but not 100 (not guaranteed), creating a smooth hill of ratings.

Step A: Human annotations

What happens: 328 participants, each scoring up to 30 items, used a slider from 0 (definitely not) to 100 (definitely). There were practice examples, guidance on ranges, and attention checks. Two validation rounds showed highly reproducible results.
Why this step exists: We need a trustworthy gold picture of how humans actually spread their judgments across the scale.
Example: For ‘The team rigged the contest in their favor → They won,’ many people will pick high values, but not all will go to 100, making a mostly single-peaked distribution.

🍞 Hook (Reasoning chains): Imagine showing your work on a math problem. Even if the final answer is a single number, the steps you took matter.

🥬 The Concept (Reasoning chains): These are the model’s intermediate, written steps before the final number. How it works: 1) The model self-narrates possible scenarios. 2) It weighs them. 3) It outputs a single number in <answer> tags. Why it matters: The steps can reveal patterns in how models evaluate likelihoods (like considering alternatives). Without them, we miss clues about their decision-making.

🍞 Anchor: Premise: ‘The girl looked pale.’ Effect: ‘Her father read her a story.’ A model might reason: ‘Comforting is plausible, but alternatives include medicine or a doctor visit,’ before giving a mid-range number.

Step B: Model elicitation via verbalized numbers

What happens: For each item, each model is prompted to think and then give a single 0–100 number. The same numeric guide humans used is provided. This is repeated 30 times per item (default temperature), yielding a model distribution per item.
Why this step exists: Many top models don’t share internal probabilities, so we ask for their explicit estimate, matching the human format.
Example: ‘There was an accident on the highway → Traffic was worse than usual.’ Humans might cluster around 60–85. Some models output 90+ repeatedly, avoiding 60–70.

🍞 Hook (Differential entropy): Think of guessing where jellybeans are on a line. If everyone points to almost the same spot, there’s little mystery; if guesses are spread out, there’s lots of mystery.

🥬 The Concept (Differential entropy): It measures how spread out a continuous set of numbers is (here: 0–100 scores). How it works: 1) Look at the whole distribution of scores. 2) Calculate its overall ‘spread of information.’ 3) Higher values mean more dispersion; lower mean tighter agreement. Why it matters: Not all disagreements are equal. Entropy captures the shape of uncertainty better than simple variance when distributions have odd shapes. Without it, we might misread patterns of disagreement.

🍞 Anchor: If human ratings for an item mostly sit tightly between 80–90, entropy is low. If ratings scatter from 20 to 90, entropy is higher.

🍞 Hook (Wasserstein distance): Picture moving piles of sand to reshape one pile into another. The more sand you have to move, and the farther you move it, the bigger the distance.

🥬 The Concept (Wasserstein distance): It quantifies how different two full distributions are (human vs. model). How it works: 1) Treat each distribution like a pile of probability mass. 2) Compute the minimum work to transform one pile into the other. 3) Higher distance = less similar. Why it matters: We want to compare entire curves, not just averages. Without this, a similar average could hide very different shapes.

🍞 Anchor: If humans cluster around 70 and models cluster around 90, you must ‘move’ a lot of mass rightward—distance is large.

Step C: Analysis metrics and tests

What happens: For each item, compute (a) human and model medians; (b) entropies of each; (c) Wasserstein distance between human and model distributions. Also test unimodality for human data.
Why this step exists: It lets us see not just who’s ‘on average’ close, but how well the whole shape matches.
Example: Many items show human unimodality (a single main hill), but models’ responses look bi-modal overall and skip the middle.

Step D: Reasoning-chain analysis

What happens: Sample 100 chains across models and hand-inspect patterns. Also correlate chain length with human entropy and response time.
Why this step exists: It reveals how models think and whether they ‘work harder’ on harder (more uncertain) items.
Example: Chains often explicitly consider alternative scenarios before picking a number; chain length correlates more with human disagreement than with human time-on-task.

Step E: Behavior variations and ensembling

What happens: Try higher/lower temperature, increase ‘thinking budget’ or reasoning effort, and add persona prompts. Also ensemble (combine) multiple models’ answers.
Why this step exists: To see if tweaks make model distributions more human-like or more varied.
Example: Higher temperature sometimes increases diversity but can produce unusable rambly outputs; more ‘thinking’ doesn’t change medians much; personas don’t reach human-like variation; ensembling helps but doesn’t reach the human-human similarity baseline.

Secret sauce: The clever part is matching humans and models on the exact same kind of answer (0–100), collecting many samples from both, and comparing full distributions with thoughtful metrics. This turns fuzzy ‘uncertainty’ into measurable shapes you can line up side by side.

04Experiments & Results

The test: The authors measured how likely humans and models thought each effect was, given a premise, on a 0–100 scale. They cared about two things: 1) how spread out the answers were (entropy), and 2) how similar the whole shape of answers was between humans and models (Wasserstein distance). They also checked human reproducibility with two validation rounds and inspected model reasoning chains.

The competition: Eight advanced reasoning models were tested (e.g., Gemini-3, GPT-5, Claude Sonnet-4.5, Qwen3, Kimi-K2, GLM-4.6, DeepSeek-R1, Grok-4.1 Fast). Each model was sampled 30 times per item to get a distribution of scores.

The scoreboard with context:

Human patterns: Across the whole dataset, humans used the full scale and produced many in-between ratings, not just extremes. For individual items, human distributions were almost always unimodal (one main hill), showing graded but focused judgments rather than split camps.
Model patterns: Models’ overall responses were often bi-modal and avoided the middle of the scale. In plain terms, they ‘jumped to strong conclusions’ and rarely said ‘somewhat likely/unlikely.’ GPT-5 was the least extreme of the set but still showed this avoidance.
Alignment by difficulty: On very obvious items (clearly likely or clearly unlikely), models’ medians and distributions were closer to humans (smaller Wasserstein distances). On trickier, middle-range items—where humans disagreed more (higher entropy)—model-human distances were much larger. In school terms, models earned an ‘A’ on the easy questions and a ‘C–D’ on the tricky, in-between ones, while humans remained smooth and nuanced across the board.
Variation gap: For every item, human responses showed more spread (higher entropy) than model samples. Even when model temperature was raised, models often either didn’t reach human-like variety or began producing long, low-quality rambles without valid final numbers.
Reasoning effort: Increasing the ‘thinking budget’ or reasoning effort did not lead to statistically significant shifts in model medians on tested items, contrasting with some prior reports that ‘more thinking’ causes more overconfidence.
Reasoning-chain insights: By hand-inspecting 100 chains, the authors found a common pattern: models frequently enumerate alternatives (“could be X, could be Y…”) and then pick. Also, models tended to produce longer chains on items where humans disagreed more (entropy correlation up to about 0.50), but chain length correlated weakly with human response time.
Ensembling: Combining answers across models made distributions closer to human ones, but still not as close as a new group of humans was to the original human group (i.e., the human-human baseline remained best).

Surprising findings:

Human unimodality: Even where people disagreed, their ratings per item usually formed a single main hill rather than two opposing camps. That suggests consistent interpretations with natural uncertainty, not clashing interpretations.
No ‘medium consensus’ items: There were no items where people tightly agreed on a pure middle value; middle-likelihood items instead showed more spread and longer response times—a signal that these are genuinely harder judgments.
Persona prompting didn’t fix diversity: Giving models different ‘personas’ slightly changed behavior but failed to reach human-level variation or human-like distributions.

Bottom line: Models align with humans on the obvious stuff but are overconfident and under-diverse on the uncertain, middle-likelihood terrain where real-life judgment is most needed.

05Discussion & Limitations

Limitations:

Language scope: PROBCOPA is in English; results may differ across languages and cultures where background knowledge and conventions vary.
Data source: Items come from COPA (an older dataset), although reframed. It’s possible that some sentences exist in model pretraining data, which might subtly influence behavior, even if the probabilistic framing is new.
Verbalized likelihoods: Asking models for 0–100 numbers is practical but raises faithfulness questions—do these numbers truly reflect internal beliefs, or are they influenced by prompting style?
Black-box constraints: Many state-of-the-art reasoning models don’t expose internal probabilities, limiting more direct comparisons.

Required resources:

A crowdsourcing setup with clear instructions, attention checks, and fair compensation to collect reliable human distributions.
Access to multiple top reasoning models (and batch APIs) to sample many outputs per item.
Compute and statistical tools to estimate entropy and Wasserstein distances robustly.

When not to use this approach:

If you need hard, verifiable labels (like a math answer key), this graded-likelihood setting is unnecessary.
If your application cannot tolerate any ambiguity (e.g., certain safety-critical triggers), you might need deterministic rules in addition to probabilistic judgments.
If you can’t sample multiple model outputs, you’ll miss distributional comparisons and may draw shallow conclusions from single answers.

Open questions:

Can new training or alignment methods help models use the middle of the scale appropriately and reflect human-like uncertainty?
Are there prompts or interfaces that elicit more faithful, stable 0–100 estimates without devolving into rambly outputs at higher temperatures?
How do results change in other languages or domains (medicine, law, education), where background knowledge and stakes differ?
Can we design better metrics than entropy and Wasserstein for nuanced, human-facing uncertainty comparisons?
How do we measure and improve the faithfulness of reasoning chains—do they truly reflect the model’s internal decision process?

Takeaway: As AI moves into human-facing roles, ‘How sure are you?’ matters as much as ‘Are you right?’ This paper shows models still need work to reflect the richer, more cautious patterns of human uncertainty.

06Conclusion & Future Work

Three-sentence summary: The authors built PROBCOPA, a dataset of 210 premise-effect items with 25–30 human 0–100 likelihood ratings each, creating a reliable picture of human probabilistic judgments. Comparing these against eight advanced reasoning models, they found models match humans on obvious cases but avoid the middle and show far less variation, leading to large divergences on uncertain items. Analyzing model reasoning chains revealed a shared habit of considering alternatives, but tweaks like more ‘thinking time’ or personas did not close the human-model gap.

Main achievement: Establishing a rigorous, distribution-level evaluation of probabilistic language inferences and revealing a consistent, meaningful divergence between human and model judgments—especially on the messy middle where uncertainty is real.

Future directions: Explore training or alignment methods that encourage calibrated, middle-range use of the scale; test multilingual and domain-specific versions of PROBCOPA; develop better elicitation methods for model likelihood estimates and stronger checks on chain-of-thought faithfulness; and refine metrics for human-model uncertainty similarity.

Why remember this: Real life is not a worksheet with one right answer—it's a landscape of likelihoods. This paper shows that while today’s reasoning AIs can shine on certainties, they stumble in the gray zones that people navigate every day. Measuring full distributions, not just single answers, is key to building AI that reasons more like us—carefully, humbly, and helpfully.

Practical Applications

•Build training curricula that reward models for calibrated, middle-range likelihoods on uncertain items.
•Use PROBCOPA-style evaluation when deploying AI advisors in healthcare, education, or customer support to detect overconfidence.
•Combine (ensemble) multiple models to reduce extremes and move closer to human-like distributions.
•Design user interfaces that show distributions or confidence bands instead of a single point estimate.
•Adopt prompting templates that explicitly ask models to consider alternatives before giving a number.
•Monitor entropy and Wasserstein distance during model updates to ensure uncertainty handling doesn’t regress.
•Create domain-specific PROBCOPA variants (e.g., medical symptoms → possible outcomes) to tune and test calibration.
•Flag high-divergence cases (large human–model Wasserstein distance) for human review in critical applications.
•Incorporate human-likeness penalties into RLHF/RLAIF pipelines to discourage edge-only responses.
•Educate end-users with legends (like the 0–100 guide) so they understand what the numbers mean and why ‘middle’ is valuable.

Version: 1