LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals
Key Summary
- â˘This paper builds LIBERTy, a new way to fairly judge how well AI explains its decisions about big, human ideas like age, race, or experience.
- â˘Instead of asking people to hand-write counterfactual texts, LIBERTy uses a causal recipe (an SCM) where an LLM helps generate both the original text and its precise 'what-if' version.
- â˘The framework creates three realistic datasets (disease detection, CV screening, and workplace violence prediction) with structured counterfactual text pairs.
- â˘A new metric called order-faithfulness checks if an explanation at least ranks concepts in the right order, even if the scores use different scales.
- â˘Across many models and methods, simple matching methodsâespecially those using embeddings from a fine-tuned modelâwere the most faithful.
- â˘There is still big room for improvement: the best local methods get about 0.7 out of 1 on ordering and about 0.3 error where 0 is perfect.
- â˘Zero-shot LLMs like GPT-4o showed low sensitivity to demographic changes, likely because of safety and fairness alignment, while a fine-tuned small LLM (Qwen2.5-1.5B) tracked ground-truth effects better.
- â˘LIBERTy is scalable, cheaper than human-written counterfactuals, and keeps the evaluation aligned with the actual data-generating process.
- â˘This benchmark helps researchers build and compare more trustworthy, concept-based explanations for high-stakes AI decisions.
Why This Research Matters
When AI helps decide who gets hired, what illness someone might have, or how safe a workplace is, we must know which big ideas truly drove the decision. LIBERTy gives us a reliable way to check whether explanations about these ideas are faithful to cause-and-effect, not just good-looking guesses. It replaces costly human editing with a causal, scalable process that keeps the evaluation aligned with how the data were actually generated. This helps teams choose explanation methods that are trustworthy and practical for real decision-making. It also reveals where todayâs methods and models fall short, guiding future improvements. As more real-world data become LLM-generated, LIBERTyâs approach becomes even more relevant.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: You know how teachers want students to show their work, not just give answers? People want AIs to do that too, especially when the answers affect real livesâlike hiring, healthcare, or safety.
𼏠The Concept: Concept-based explanations are about asking an AI, âHow much did a big idea (like gender, experience, or a symptom) matter for your decision?â
- What it is: A way to measure the influence of human-understandable ideas on a modelâs prediction.
- How it works: (1) Pick a concept; (2) see how changing it would change the modelâs output; (3) report how big that change is.
- Why it matters: Without this, we only see token-level clues (like single words), not the big ideas people actually care about.
đ Anchor: If an AI reads a CV and says âRecommended,â concept-based explanations tell you whether education or experience mattered more.
The World Before: AI could do many text tasks well, but explaining âwhyâ in human terms was hard. A popular step forward used human-written counterfactuals: people edited a text to reflect a change (like making the author older) and we compared model predictions before vs. after. This powered the CEBaB benchmark, which was a big milestone but had limits.
The Problem: Human-written counterfactuals are costly, can be inconsistent, and donât come from the true cause-and-effect process that produced the original text. In other words, theyâre good approximations, but they are not the actual âwhat-ifâ from the same world with just one change.
đ Hook (new concept): Imagine a Lego city with rules for how roads connect and where houses can go. If you move one house, the traffic patterns change in predictable ways. Youâd want your test to follow the cityâs rules.
𼏠The Concept (SCMs â Structured Causal Models): An SCM is a clear map of what causes what, plus simple equations that say how changes spread.
- What it is: A diagram with variables (like age, symptoms, or job quality) and arrows that show how one affects another, along with rules for generating data.
- How it works: (1) Define variables and arrows; (2) add noise terms to represent natural randomness; (3) use the rules to create examples.
- Why it matters: Without this map, we canât reliably create the exact âwhat-ifâ where only one thing changes.
đ Anchor: In a disease dataset, the SCM can say, âDisease â symptoms,â so changing the disease in a what-if should ripple into different symptoms.
Failed Attempts: Relying on human edits or free-form LLM edits can drift into a different mini-story (different style, different details), so you might accidentally change more than the one intended concept. That makes the comparison unfair.
The Gap: We needed a benchmark where counterfactuals come from the same data-generating recipe as the original, so we can truly measure the effect of a single concept change.
Real Stakes: In hiring, a trustworthy explanation can show that educationânot genderâdrove a recommendation. In health, it can show which symptoms mattered most. In workplace safety, it can surface how role and department affect risk. For people making real decisions, this difference between a faithful and an unfaithful explanation really matters.
đ Hook (new concept): Imagine asking, âIf I only change the flour in a cake recipe, what changes in the cake?â Youâd keep the oven, pan, and temperature the same.
𼏠The Concept (Structural counterfactuals): A structural counterfactual is a precise âwhat-ifâ created by the SCM while keeping the background randomness fixed.
- What it is: The output you get when you change one concept and rerun the same causal recipe with the same hidden conditions.
- How it works: (1) Fix the hidden randomness; (2) set the new value for the target concept; (3) let the SCM rules update everything downstream; (4) regenerate the text deterministically.
- Why it matters: Without this, we canât tell if changes in the output came from the concept change or from unrelated randomness.
đ Anchor: For a CV, keep the same writing style and persona (background), but change âeducation: bachelorâs â masterâs,â then see how the modelâs hiring score changes.
This paperâs answer: LIBERTy. It explicitly builds SCMs for three realistic text tasks, uses deterministic LLM generation with fixed style and persona to keep exogenous stuff constant, and produces structural counterfactual text pairs for rigorous, causal evaluation. It also adds a friendlier metricâorder-faithfulnessâthat focuses on whether an explanation at least ranks concepts correctly, even if scores come on different scales.
02Core Idea
đ Hook: Imagine testing a robot chef. If you want to know how much sugar matters, you donât ask a person to rewrite the recipeâyou change only the sugar and keep the oven, pan, and steps the same.
𼏠The Concept (Aha!): The key insight is to generate counterfactual texts inside a causal recipe (an SCM) where the LLM is part of the process and exogenous things (like persona and template) stay fixed, so we can measure true cause-and-effect of concept changes.
- What it is: A causal framework (LIBERTy) that creates structural counterfactual text pairs and uses them to benchmark concept-based explanations.
- How it works: (1) Define an SCM for the task; (2) plug in an LLM that writes the text deterministically from concept values; (3) add fixed persona and template to provide realistic style; (4) intervene on a concept; (5) regenerate the text under the same conditions; (6) compare model predictions to get the causal effect; (7) evaluate explanation methods using error distance and order-faithfulness.
- Why it matters: It aligns evaluation with the actual data-generating process, avoiding the drift and noise of human or free-form edits.
đ Anchor: In workplace violence prediction, switching department while keeping the same persona and template produces a text where only department-related content changes, letting us fairly measure its effect on the predicted risk.
Three analogies for the same idea:
- Cooking: Keep the oven, pan, and steps the same (persona, template, decoding). Change only one ingredient (the concept). Taste the difference (model prediction change). Thatâs the causal effect.
- Theater: Same stage, same actors, same script structure (persona + template). Change one roleâs costume (concept value). Watch how the audienceâs reaction (model output) shifts.
- Lego City: Same city map and traffic rules (SCM). Move one building (concept). The traffic pattern (prediction) changes in a traceable, rule-driven way.
đ Hook (new concept): You know how a family ranks chores from most to least important when time is short?
𼏠The Concept (Order-faithfulness): A metric checking if an explanation gets the relative order of concept importance right.
- What it is: A score that says whether concept A should be above concept B and your method agreesâregardless of the raw scale.
- How it works: (1) Compute true effect sizes from structural counterfactuals; (2) get the methodâs importance scores; (3) compare pairwise orderings; (4) report the fraction of correct orderings.
- Why it matters: Different methods use different scales; ordering is often what humans and policies need first.
đ Anchor: If the gold ordering says âexperience > education > volunteering,â and your method says the same order, your explanation is order-faithful.
Before vs. After:
- Before: Benchmarks relied on human-written counterfactuals or token-level proxies, often misaligned with the true causal process.
- After: LIBERTy ties everything to an explicit SCM and deterministic generation, so each counterfactual is a true âwhat-if,â letting us test if explanations match real causal effects.
Why It Works (intuition):
- The SCM pins down who-causes-what, so interventions are clean.
- Deterministic decoding (temperature 0) plus fixed persona/template keeps the noise constant across original and counterfactual.
- Comparing model outputs across these paired texts gives the causal effect per input (ICaCE) and on average (CaCE).
đ Hook (new concept): Think of weather vs. climate: one dayâs rain (local) versus average patterns (global).
𼏠The Concept (CaCE and ICaCE): Measures of how much changing a concept shifts the modelâs outputs, globally and per-example.
- What it is: CaCE is the average effect across the dataset; ICaCE is the effect for one specific input pair.
- How it works: (1) Generate structural counterfactual pairs; (2) feed both into the model; (3) take the difference in predicted probabilities; (4) average for CaCE or keep per-instance for ICaCE.
- Why it matters: Without these, we canât score explanations against true causal ground.
đ Anchor: For one CV, ICaCE shows how much switching bachelorâs â masterâs changes the recommendation. Across many CVs, CaCE shows the average effect of education.
Building Blocks:
- SCMs with clear causal arrows and simple noise.
- An LLM that turns concept values into text deterministically.
- Persona and template as exogenous context to add realism without changing the causal story.
- Interventions that flip just one concept.
- Metrics: Error Distance (how close the explanationâs vector is to the true effect) and Order-Faithfulness (does it rank concepts correctly?).
03Methodology
High-level recipe: Concepts + Persona + Template â Deterministic LLM Text â Model Prediction â Intervene on One Concept â Regenerate Text (same persona/template) â New Prediction â Compute Causal Effect â Score Explanations.
Step-by-step (with why each step exists and what breaks without it):
- Define the SCM (the causal map and rules)
- What happens: Researchers draw a directed graph of concepts (e.g., age, department, symptoms), say how they influence each other, and add simple equations with noise terms. This becomes the data-generating process.
- Why it exists: The SCM tells us exactly how to create the original example and what must change after an intervention.
- What breaks without it: Counterfactuals become guesses, and we canât be sure we only changed the right thing.
- Example: In disease detection, Disease â Symptoms. Migraine raises chances of headache and light sensitivity; sinusitis raises nasal congestion and facial pain.
- Add exogenous grounding: Persona and Template
- What happens: Each text gets a fixed persona (style and background tidbits) and a template (narrative structure) sampled once and then held constant for the counterfactual.
- Why it exists: Deterministic decoding can create boring, repetitive texts if we donât supply rich context. Persona and template add realism and diversity while staying fixed across counterfactuals.
- What breaks without it: Either texts are too dull and samey, or if we use randomness, the counterfactual and original differ in style and content unrelated to the concept change.
- Example: A CV template that starts with a hook, presents education and experience, then closes with impact; a persona that mentions a family anecdote and a signature professional strength.
- Deterministic text generation (temperature = 0)
- What happens: The LLM (GPT-4o) turns concept values + persona + template into a single text with no sampling randomness.
- Why it exists: Ensures the only difference between original and counterfactual is the intervention on the concept.
- What breaks without it: If we sample, token-level randomness becomes untracked noise; we canât guarantee a structural counterfactual.
- Example: Given {department=ICU, age=44, gender=male} plus a persona and template, the LLM produces one fixed HR interview transcript.
- Structural counterfactual generation via Pearlâs three steps
- What happens: Abduction (fix the exogenous randomness, including persona and template), Action (set the concept to a new value), Prediction (propagate through SCM and regenerate text deterministically).
- Why it exists: This ensures a true, single-change what-if.
- What breaks without it: Counterfactuals might drift into a new mini-story or hidden noise might change too, corrupting the measurement.
- Example: Change department from Psychiatric to ICU while keeping the same persona and template, then regenerate the HR interview.
- Train or prompt the explained models
- What happens: Five models are evaluated: three fine-tuned (DeBERTa-v3, T5-base, Qwen2.5-1.5B-Instruct) and two zero-shot LLMs (Llama-3.1-8B-Instruct, GPT-4o). They predict the target label (violence risk, disease, or CV quality) from the text.
- Why it exists: We need a model to explain.
- What breaks without it: No predictions to compare.
- Example: Given a CV text, the model outputs probabilities over {Not recommended, Potential hire, Recommended}.
- Compute ICaCE and CaCE
- What happens: For each originalâcounterfactual pair, subtract the modelâs output vectors to get ICaCE (per input). Average over many pairs to get CaCE (global effect).
- Why it exists: These are the causal references weâll compare explanations against.
- What breaks without it: No gold standard to decide if an explanation is faithful.
- Example: For one CV, moving bachelorâs â masterâs changes the modelâs probabilities by [+0.12 recommended, â0.10 potential, â0.02 not]; thatâs the ICaCE vector.
- Run explanation methods and score them
- What happens: Evaluate multiple families:
- Counterfactual generation by LLM editing (prompting with causal hints)
- Matching (retrieve similar examples with the target concept value)âbest is FT Match using embeddings from a model fine-tuned on the dataset
- Concept erasure (LEACE) to remove linear concept info
- Concept attributions (TCAV + ConceptShap) for global importance
- Why it exists: We want to see which approach best reflects the true causal effects.
- What breaks without it: No way to compare methods or make progress.
- Example: For matching, find top-k candidate texts that are most similar but have the target concept value changed; compare their predictions to estimate effect.
- Metrics: Error Distance (ED) and Order-Faithfulness (OF)
- What happens: ED measures how close an explanationâs effect vector is to the gold ICaCE vector (lower is better). OF measures if the method gets the pairwise ordering of concept importances correct (higher is better).
- Why it exists: Methods output scores on different scales; OF focuses on ranking, which is robust and often what users need.
- What breaks without it: Apples-to-oranges comparisons; we may punish good rankings just because of scale.
- Example: If gold says education > experience > volunteering, and your method agrees across pairs, OF is high.
The Secret Sauce:
- Causal grounding: The SCM anchors what counts as a valid counterfactual.
- Determinism + fixed persona/template: Keeps noise and style constant while still producing rich, human-like text.
- Order-Faithfulness: A practical, human-aligned yardstick that rewards getting the ranking right even when numeric scales differ.
04Experiments & Results
The Test: The authors built three datasetsâWorkplace Violence (HRânurse interviews), Disease Detection (forum self-reports), and CV Screening (personal statements)âeach with a richer causal graph than older benchmarks. They evaluated five explained models (three fine-tuned, two zero-shot LLMs) and eight explanation methods across four families. They measured two things: (1) how close explanations are to the true effects (ED), and (2) whether explanations rank concepts in the right order (OF).
The Competition: Prior work (like CEBaB) leaned on human-written counterfactuals and often found LLM-generated counterfactual edits to be strong. Here, methods competed under stricter, structural counterfactuals. Families included: LLM editing, matching (semantic and concept-based), concept erasure (LEACE), and concept attributions (TCAV + ConceptShap).
The Scoreboard (with context):
- Matching wins overall. In particular, FT Matchâusing embeddings from a model fine-tuned on the datasetâs labelâachieved the lowest error distances and the highest order-faithfulness on average. Think of it like getting an Aâ when most others got Bâs.
- LLM-generated counterfactual edits did not dominate here, unlike in CEBaB. Mimicking human edits isnât enough when the gold truth is the structural causal effect.
- Best local methods reached around 0.7 OF (on a 0â1 scale), meaning they ranked concepts correctly about 70% of the time; error distances hovered around 0.3 (0 is perfect), showing meaningful but incomplete accuracy.
- For global explanations (ranking concepts across the whole dataset), matching also came out on top in order-faithfulness.
Surprising Findings:
- Zero-shot LLMs (e.g., GPT-4o, Llama-3.1-8B) showed low sensitivity to demographic interventions (like race, gender, age). That likely reflects post-training alignment to be cautious around such attributesâgood for safety, but it also means their predictions changed less than the ground-truth causal effects would suggest in the synthetic data.
- Among fine-tuned models, Qwen2.5-1.5B tracked the true causal effects more closely than others, but still didnât perfectly match themâvanilla fine-tuning isnât enough to fully absorb the causal structure.
- Every global method missed at least one of the top-3 gold-important concepts per dataset-model combo, signaling big headroom for better global explanations.
Takeaway: When we switch from human-aligned edits to true structural counterfactuals, the leaderboard flips. Methods that find near neighbors inside the same data-generating process (matching) beat methods that try to write their own counterfactual edits. And even the best current methods have lots of room to improve, especially on capturing fine-grained causal effects.
05Discussion & Limitations
Limitations:
- Synthetic text: LIBERTy uses LLM-generated texts, which are not human-written. Human evaluation showed high quality and consistency, but nuances of human expression may still be missing.
- Concept-only focus: The benchmark centers on concept-based explanations, not token-level rationales or free-text explanations.
- Simplified causal worlds: The SCMs are plausible but simplified. Theyâre not claims about realityâjust controlled testbeds for fair evaluation.
Required Resources:
- Access to strong LLMs for text generation and to train/evaluate explained models (or prompts for zero-shot).
- Compute to generate datasets, fine-tune models, and run explanation methods (especially matching and attribution pipelines).
- Causal graph design time to specify SCMs and choose meaningful interventions.
When NOT to Use:
- If your application cannot accept synthetic data as a testbed, or you need explanations tied directly to real-world, human-authored corpora.
- If you only care about token-level clues and not higher-level human concepts.
- If stochastic generation (creative variety) is required in the counterfactual itself; structural counterfactuals demand deterministic decoding.
Open Questions:
- Can we design explanation methods that estimate causal effects directly with less dependence on retrieval or heavy fine-tuning?
- How can we build causal learning techniques so models better align with the SCMâs structure (beyond vanilla fine-tuning)?
- Can we broaden beyond concept-based explanations to unify token-level, concept-level, and free-text rationales under one causal umbrella?
- How should we evaluate explanations when real data are partly LLM-generated and partly human-generatedâa likely future scenario?
- Can controlled decoding or multi-sample averaging offer faithful structural counterfactuals without strict determinism?
06Conclusion & Future Work
3-Sentence Summary: LIBERTy is a causal framework that builds structural counterfactual datasets using SCMs where an LLM helps write realistic texts under fixed conditions, so we can measure true concept effects. With new datasets and a practical ranking metric (order-faithfulness), LIBERTy shows that matching-based methodsâespecially with fine-tuned embeddingsâcurrently provide the most faithful concept-based explanations. The results also reveal that modern LLMs tend to be less sensitive to demographic changes, and that there is significant room to improve both local and global explanations.
Main Achievement: Turning counterfactual evaluation into a fully causal, structural processâaligning the benchmarkâs gold references with the actual data-generating procedure and making the tests scalable and rigorous.
Future Directions: Improve explanation methods that combine causal structure with learned representations; explore causal training that aligns models with the SCM; expand to richer tasks, mixed human/LLM corpora, and hybrid explanation types; and develop robust techniques that remain faithful under distribution shifts.
Why Remember This: LIBERTy reframes how we evaluate explanationsâfrom mimicking human edits to measuring true causal effectsâand provides a practical path to building AI systems whose reasoning about human concepts we can actually trust.
Practical Applications
- â˘Auditing hiring models: Verify whether education, experience, and certifications truly drive recommendations over demographic attributes.
- â˘Clinical triage support: Check if symptom-based models rely on the right clinical concepts when suggesting likely conditions.
- â˘Workplace safety assessment: Understand which job-related factors (e.g., department or tenure) most influence predicted risk.
- â˘Policy compliance checks: Use order-faithfulness to make sure models rank sensitive concepts appropriately (e.g., demographics not outweighing qualifications).
- â˘Model selection: Compare explanation methods on LIBERTy to pick the most faithful one before deployment.
- â˘Causal stress-testing: Intervene on key concepts and measure how sensitive each model is, detecting over- or under-sensitivity.
- â˘Training improvement: Use gaps between model sensitivity and gold effects to motivate causal fine-tuning or data collection.
- â˘Benchmark extension: Plug in new SCMs and domains (e.g., finance, education) to build task-specific, causal evaluation suites.
- â˘Safety alignment evaluation: Quantify how alignment affects sensitivity to demographic concepts.
- â˘Regulatory reporting: Provide structured, causal evidence of how concepts influence automated decisions.