LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals

Gilat Toker; Nitay Calderon; Ohad Amosy; Roi Reichart

LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals

Intermediate

Gilat Toker, Nitay Calderon, Ohad Amosy et al.1/15/2026

arXiv PDF

Key Summary

•This paper builds LIBERTy, a new way to fairly judge how well AI explains its decisions about big, human ideas like age, race, or experience.
•Instead of asking people to hand-write counterfactual texts, LIBERTy uses a causal recipe (an SCM) where an LLM helps generate both the original text and its precise 'what-if' version.
•The framework creates three realistic datasets (disease detection, CV screening, and workplace violence prediction) with structured counterfactual text pairs.
•A new metric called order-faithfulness checks if an explanation at least ranks concepts in the right order, even if the scores use different scales.
•Across many models and methods, simple matching methods—especially those using embeddings from a fine-tuned model—were the most faithful.
•There is still big room for improvement: the best local methods get about 0.7 out of 1 on ordering and about 0.3 error where 0 is perfect.
•Zero-shot LLMs like GPT-4o showed low sensitivity to demographic changes, likely because of safety and fairness alignment, while a fine-tuned small LLM (Qwen2.5-1.5B) tracked ground-truth effects better.
•LIBERTy is scalable, cheaper than human-written counterfactuals, and keeps the evaluation aligned with the actual data-generating process.
•This benchmark helps researchers build and compare more trustworthy, concept-based explanations for high-stakes AI decisions.

Why This Research Matters

When AI helps decide who gets hired, what illness someone might have, or how safe a workplace is, we must know which big ideas truly drove the decision. LIBERTy gives us a reliable way to check whether explanations about these ideas are faithful to cause-and-effect, not just good-looking guesses. It replaces costly human editing with a causal, scalable process that keeps the evaluation aligned with how the data were actually generated. This helps teams choose explanation methods that are trustworthy and practical for real decision-making. It also reveals where today’s methods and models fall short, guiding future improvements. As more real-world data become LLM-generated, LIBERTy’s approach becomes even more relevant.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how teachers want students to show their work, not just give answers? People want AIs to do that too, especially when the answers affect real lives—like hiring, healthcare, or safety.

🥬 The Concept: Concept-based explanations are about asking an AI, “How much did a big idea (like gender, experience, or a symptom) matter for your decision?”

What it is: A way to measure the influence of human-understandable ideas on a model’s prediction.
How it works: (1) Pick a concept; (2) see how changing it would change the model’s output; (3) report how big that change is.
Why it matters: Without this, we only see token-level clues (like single words), not the big ideas people actually care about.

🍞 Anchor: If an AI reads a CV and says “Recommended,” concept-based explanations tell you whether education or experience mattered more.

The World Before: AI could do many text tasks well, but explaining “why” in human terms was hard. A popular step forward used human-written counterfactuals: people edited a text to reflect a change (like making the author older) and we compared model predictions before vs. after. This powered the CEBaB benchmark, which was a big milestone but had limits.

The Problem: Human-written counterfactuals are costly, can be inconsistent, and don’t come from the true cause-and-effect process that produced the original text. In other words, they’re good approximations, but they are not the actual “what-if” from the same world with just one change.

🍞 Hook (new concept): Imagine a Lego city with rules for how roads connect and where houses can go. If you move one house, the traffic patterns change in predictable ways. You’d want your test to follow the city’s rules.

🥬 The Concept (SCMs – Structured Causal Models): An SCM is a clear map of what causes what, plus simple equations that say how changes spread.

What it is: A diagram with variables (like age, symptoms, or job quality) and arrows that show how one affects another, along with rules for generating data.
How it works: (1) Define variables and arrows; (2) add noise terms to represent natural randomness; (3) use the rules to create examples.
Why it matters: Without this map, we can’t reliably create the exact “what-if” where only one thing changes.

🍞 Anchor: In a disease dataset, the SCM can say, “Disease → symptoms,” so changing the disease in a what-if should ripple into different symptoms.

Failed Attempts: Relying on human edits or free-form LLM edits can drift into a different mini-story (different style, different details), so you might accidentally change more than the one intended concept. That makes the comparison unfair.

The Gap: We needed a benchmark where counterfactuals come from the same data-generating recipe as the original, so we can truly measure the effect of a single concept change.

Real Stakes: In hiring, a trustworthy explanation can show that education—not gender—drove a recommendation. In health, it can show which symptoms mattered most. In workplace safety, it can surface how role and department affect risk. For people making real decisions, this difference between a faithful and an unfaithful explanation really matters.

🍞 Hook (new concept): Imagine asking, “If I only change the flour in a cake recipe, what changes in the cake?” You’d keep the oven, pan, and temperature the same.

🥬 The Concept (Structural counterfactuals): A structural counterfactual is a precise “what-if” created by the SCM while keeping the background randomness fixed.

What it is: The output you get when you change one concept and rerun the same causal recipe with the same hidden conditions.
How it works: (1) Fix the hidden randomness; (2) set the new value for the target concept; (3) let the SCM rules update everything downstream; (4) regenerate the text deterministically.
Why it matters: Without this, we can’t tell if changes in the output came from the concept change or from unrelated randomness.

🍞 Anchor: For a CV, keep the same writing style and persona (background), but change “education: bachelor’s → master’s,” then see how the model’s hiring score changes.

This paper’s answer: LIBERTy. It explicitly builds SCMs for three realistic text tasks, uses deterministic LLM generation with fixed style and persona to keep exogenous stuff constant, and produces structural counterfactual text pairs for rigorous, causal evaluation. It also adds a friendlier metric—order-faithfulness—that focuses on whether an explanation at least ranks concepts correctly, even if scores come on different scales.

02Core Idea

🍞 Hook: Imagine testing a robot chef. If you want to know how much sugar matters, you don’t ask a person to rewrite the recipe—you change only the sugar and keep the oven, pan, and steps the same.

🥬 The Concept (Aha!): The key insight is to generate counterfactual texts inside a causal recipe (an SCM) where the LLM is part of the process and exogenous things (like persona and template) stay fixed, so we can measure true cause-and-effect of concept changes.

What it is: A causal framework (LIBERTy) that creates structural counterfactual text pairs and uses them to benchmark concept-based explanations.
How it works: (1) Define an SCM for the task; (2) plug in an LLM that writes the text deterministically from concept values; (3) add fixed persona and template to provide realistic style; (4) intervene on a concept; (5) regenerate the text under the same conditions; (6) compare model predictions to get the causal effect; (7) evaluate explanation methods using error distance and order-faithfulness.
Why it matters: It aligns evaluation with the actual data-generating process, avoiding the drift and noise of human or free-form edits.

🍞 Anchor: In workplace violence prediction, switching department while keeping the same persona and template produces a text where only department-related content changes, letting us fairly measure its effect on the predicted risk.

Three analogies for the same idea:

Cooking: Keep the oven, pan, and steps the same (persona, template, decoding). Change only one ingredient (the concept). Taste the difference (model prediction change). That’s the causal effect.
Theater: Same stage, same actors, same script structure (persona + template). Change one role’s costume (concept value). Watch how the audience’s reaction (model output) shifts.
Lego City: Same city map and traffic rules (SCM). Move one building (concept). The traffic pattern (prediction) changes in a traceable, rule-driven way.

🍞 Hook (new concept): You know how a family ranks chores from most to least important when time is short?

🥬 The Concept (Order-faithfulness): A metric checking if an explanation gets the relative order of concept importance right.

What it is: A score that says whether concept A should be above concept B and your method agrees—regardless of the raw scale.
How it works: (1) Compute true effect sizes from structural counterfactuals; (2) get the method’s importance scores; (3) compare pairwise orderings; (4) report the fraction of correct orderings.
Why it matters: Different methods use different scales; ordering is often what humans and policies need first.

🍞 Anchor: If the gold ordering says “experience > education > volunteering,” and your method says the same order, your explanation is order-faithful.

Before vs. After:

Before: Benchmarks relied on human-written counterfactuals or token-level proxies, often misaligned with the true causal process.
After: LIBERTy ties everything to an explicit SCM and deterministic generation, so each counterfactual is a true “what-if,” letting us test if explanations match real causal effects.

Why It Works (intuition):

The SCM pins down who-causes-what, so interventions are clean.
Deterministic decoding (temperature 0) plus fixed persona/template keeps the noise constant across original and counterfactual.
Comparing model outputs across these paired texts gives the causal effect per input (ICaCE) and on average (CaCE).

🍞 Hook (new concept): Think of weather vs. climate: one day’s rain (local) versus average patterns (global).

🥬 The Concept (CaCE and ICaCE): Measures of how much changing a concept shifts the model’s outputs, globally and per-example.

What it is: CaCE is the average effect across the dataset; ICaCE is the effect for one specific input pair.
How it works: (1) Generate structural counterfactual pairs; (2) feed both into the model; (3) take the difference in predicted probabilities; (4) average for CaCE or keep per-instance for ICaCE.
Why it matters: Without these, we can’t score explanations against true causal ground.

🍞 Anchor: For one CV, ICaCE shows how much switching bachelor’s → master’s changes the recommendation. Across many CVs, CaCE shows the average effect of education.

Building Blocks:

SCMs with clear causal arrows and simple noise.
An LLM that turns concept values into text deterministically.
Persona and template as exogenous context to add realism without changing the causal story.
Interventions that flip just one concept.
Metrics: Error Distance (how close the explanation’s vector is to the true effect) and Order-Faithfulness (does it rank concepts correctly?).

03Methodology

High-level recipe: Concepts + Persona + Template → Deterministic LLM Text → Model Prediction → Intervene on One Concept → Regenerate Text (same persona/template) → New Prediction → Compute Causal Effect → Score Explanations.

Step-by-step (with why each step exists and what breaks without it):

Define the SCM (the causal map and rules)

What happens: Researchers draw a directed graph of concepts (e.g., age, department, symptoms), say how they influence each other, and add simple equations with noise terms. This becomes the data-generating process.
Why it exists: The SCM tells us exactly how to create the original example and what must change after an intervention.
What breaks without it: Counterfactuals become guesses, and we can’t be sure we only changed the right thing.
Example: In disease detection, Disease → Symptoms. Migraine raises chances of headache and light sensitivity; sinusitis raises nasal congestion and facial pain.

Add exogenous grounding: Persona and Template

What happens: Each text gets a fixed persona (style and background tidbits) and a template (narrative structure) sampled once and then held constant for the counterfactual.
Why it exists: Deterministic decoding can create boring, repetitive texts if we don’t supply rich context. Persona and template add realism and diversity while staying fixed across counterfactuals.
What breaks without it: Either texts are too dull and samey, or if we use randomness, the counterfactual and original differ in style and content unrelated to the concept change.
Example: A CV template that starts with a hook, presents education and experience, then closes with impact; a persona that mentions a family anecdote and a signature professional strength.

Deterministic text generation (temperature = 0)

What happens: The LLM (GPT-4o) turns concept values + persona + template into a single text with no sampling randomness.
Why it exists: Ensures the only difference between original and counterfactual is the intervention on the concept.
What breaks without it: If we sample, token-level randomness becomes untracked noise; we can’t guarantee a structural counterfactual.
Example: Given {department=ICU, age=44, gender=male} plus a persona and template, the LLM produces one fixed HR interview transcript.

Structural counterfactual generation via Pearl’s three steps

What happens: Abduction (fix the exogenous randomness, including persona and template), Action (set the concept to a new value), Prediction (propagate through SCM and regenerate text deterministically).
Why it exists: This ensures a true, single-change what-if.
What breaks without it: Counterfactuals might drift into a new mini-story or hidden noise might change too, corrupting the measurement.
Example: Change department from Psychiatric to ICU while keeping the same persona and template, then regenerate the HR interview.

Train or prompt the explained models

What happens: Five models are evaluated: three fine-tuned (DeBERTa-v3, T5-base, Qwen2.5-1.5B-Instruct) and two zero-shot LLMs (Llama-3.1-8B-Instruct, GPT-4o). They predict the target label (violence risk, disease, or CV quality) from the text.
Why it exists: We need a model to explain.
What breaks without it: No predictions to compare.
Example: Given a CV text, the model outputs probabilities over {Not recommended, Potential hire, Recommended}.

Compute ICaCE and CaCE

What happens: For each original–counterfactual pair, subtract the model’s output vectors to get ICaCE (per input). Average over many pairs to get CaCE (global effect).
Why it exists: These are the causal references we’ll compare explanations against.
What breaks without it: No gold standard to decide if an explanation is faithful.
Example: For one CV, moving bachelor’s → master’s changes the model’s probabilities by [+0.12 recommended, −0.10 potential, −0.02 not]; that’s the ICaCE vector.

Run explanation methods and score them

What happens: Evaluate multiple families:
- Counterfactual generation by LLM editing (prompting with causal hints)
- Matching (retrieve similar examples with the target concept value)—best is FT Match using embeddings from a model fine-tuned on the dataset
- Concept erasure (LEACE) to remove linear concept info
- Concept attributions (TCAV + ConceptShap) for global importance
Why it exists: We want to see which approach best reflects the true causal effects.
What breaks without it: No way to compare methods or make progress.
Example: For matching, find top-k candidate texts that are most similar but have the target concept value changed; compare their predictions to estimate effect.

Metrics: Error Distance (ED) and Order-Faithfulness (OF)

What happens: ED measures how close an explanation’s effect vector is to the gold ICaCE vector (lower is better). OF measures if the method gets the pairwise ordering of concept importances correct (higher is better).
Why it exists: Methods output scores on different scales; OF focuses on ranking, which is robust and often what users need.
What breaks without it: Apples-to-oranges comparisons; we may punish good rankings just because of scale.
Example: If gold says education > experience > volunteering, and your method agrees across pairs, OF is high.

The Secret Sauce:

Causal grounding: The SCM anchors what counts as a valid counterfactual.
Determinism + fixed persona/template: Keeps noise and style constant while still producing rich, human-like text.
Order-Faithfulness: A practical, human-aligned yardstick that rewards getting the ranking right even when numeric scales differ.

04Experiments & Results

The Test: The authors built three datasets—Workplace Violence (HR–nurse interviews), Disease Detection (forum self-reports), and CV Screening (personal statements)—each with a richer causal graph than older benchmarks. They evaluated five explained models (three fine-tuned, two zero-shot LLMs) and eight explanation methods across four families. They measured two things: (1) how close explanations are to the true effects (ED), and (2) whether explanations rank concepts in the right order (OF).

The Competition: Prior work (like CEBaB) leaned on human-written counterfactuals and often found LLM-generated counterfactual edits to be strong. Here, methods competed under stricter, structural counterfactuals. Families included: LLM editing, matching (semantic and concept-based), concept erasure (LEACE), and concept attributions (TCAV + ConceptShap).

The Scoreboard (with context):

Matching wins overall. In particular, FT Match—using embeddings from a model fine-tuned on the dataset’s label—achieved the lowest error distances and the highest order-faithfulness on average. Think of it like getting an A− when most others got B’s.
LLM-generated counterfactual edits did not dominate here, unlike in CEBaB. Mimicking human edits isn’t enough when the gold truth is the structural causal effect.
Best local methods reached around 0.7 OF (on a 0–1 scale), meaning they ranked concepts correctly about 70% of the time; error distances hovered around 0.3 (0 is perfect), showing meaningful but incomplete accuracy.
For global explanations (ranking concepts across the whole dataset), matching also came out on top in order-faithfulness.

Surprising Findings:

Zero-shot LLMs (e.g., GPT-4o, Llama-3.1-8B) showed low sensitivity to demographic interventions (like race, gender, age). That likely reflects post-training alignment to be cautious around such attributes—good for safety, but it also means their predictions changed less than the ground-truth causal effects would suggest in the synthetic data.
Among fine-tuned models, Qwen2.5-1.5B tracked the true causal effects more closely than others, but still didn’t perfectly match them—vanilla fine-tuning isn’t enough to fully absorb the causal structure.
Every global method missed at least one of the top-3 gold-important concepts per dataset-model combo, signaling big headroom for better global explanations.

Takeaway: When we switch from human-aligned edits to true structural counterfactuals, the leaderboard flips. Methods that find near neighbors inside the same data-generating process (matching) beat methods that try to write their own counterfactual edits. And even the best current methods have lots of room to improve, especially on capturing fine-grained causal effects.

05Discussion & Limitations

Limitations:

Synthetic text: LIBERTy uses LLM-generated texts, which are not human-written. Human evaluation showed high quality and consistency, but nuances of human expression may still be missing.
Concept-only focus: The benchmark centers on concept-based explanations, not token-level rationales or free-text explanations.
Simplified causal worlds: The SCMs are plausible but simplified. They’re not claims about reality—just controlled testbeds for fair evaluation.

Required Resources:

Access to strong LLMs for text generation and to train/evaluate explained models (or prompts for zero-shot).
Compute to generate datasets, fine-tune models, and run explanation methods (especially matching and attribution pipelines).
Causal graph design time to specify SCMs and choose meaningful interventions.

When NOT to Use:

If your application cannot accept synthetic data as a testbed, or you need explanations tied directly to real-world, human-authored corpora.
If you only care about token-level clues and not higher-level human concepts.
If stochastic generation (creative variety) is required in the counterfactual itself; structural counterfactuals demand deterministic decoding.

Open Questions:

Can we design explanation methods that estimate causal effects directly with less dependence on retrieval or heavy fine-tuning?
How can we build causal learning techniques so models better align with the SCM’s structure (beyond vanilla fine-tuning)?
Can we broaden beyond concept-based explanations to unify token-level, concept-level, and free-text rationales under one causal umbrella?
How should we evaluate explanations when real data are partly LLM-generated and partly human-generated—a likely future scenario?
Can controlled decoding or multi-sample averaging offer faithful structural counterfactuals without strict determinism?

06Conclusion & Future Work

3-Sentence Summary: LIBERTy is a causal framework that builds structural counterfactual datasets using SCMs where an LLM helps write realistic texts under fixed conditions, so we can measure true concept effects. With new datasets and a practical ranking metric (order-faithfulness), LIBERTy shows that matching-based methods—especially with fine-tuned embeddings—currently provide the most faithful concept-based explanations. The results also reveal that modern LLMs tend to be less sensitive to demographic changes, and that there is significant room to improve both local and global explanations.

Main Achievement: Turning counterfactual evaluation into a fully causal, structural process—aligning the benchmark’s gold references with the actual data-generating procedure and making the tests scalable and rigorous.

Future Directions: Improve explanation methods that combine causal structure with learned representations; explore causal training that aligns models with the SCM; expand to richer tasks, mixed human/LLM corpora, and hybrid explanation types; and develop robust techniques that remain faithful under distribution shifts.

Why Remember This: LIBERTy reframes how we evaluate explanations—from mimicking human edits to measuring true causal effects—and provides a practical path to building AI systems whose reasoning about human concepts we can actually trust.

Practical Applications

•Auditing hiring models: Verify whether education, experience, and certifications truly drive recommendations over demographic attributes.
•Clinical triage support: Check if symptom-based models rely on the right clinical concepts when suggesting likely conditions.
•Workplace safety assessment: Understand which job-related factors (e.g., department or tenure) most influence predicted risk.
•Policy compliance checks: Use order-faithfulness to make sure models rank sensitive concepts appropriately (e.g., demographics not outweighing qualifications).
•Model selection: Compare explanation methods on LIBERTy to pick the most faithful one before deployment.
•Causal stress-testing: Intervene on key concepts and measure how sensitive each model is, detecting over- or under-sensitivity.
•Training improvement: Use gaps between model sensitivity and gold effects to motivate causal fine-tuning or data collection.
•Benchmark extension: Plug in new SCMs and domains (e.g., finance, education) to build task-specific, causal evaluation suites.
•Safety alignment evaluation: Quantify how alignment affects sensitivity to demographic concepts.
•Regulatory reporting: Provide structured, causal evidence of how concepts influence automated decisions.

Version: 1