COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs

Dasol Choi; DongGeon Lee; Brigitta Jesica Kartono; Helena Berndt; Taeyoun Kwon; Joonwon Jang; Haon Park; Hwanjo Yu; Minsuk Kahng

COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs

Intermediate

Dasol Choi, DongGeon Lee, Brigitta Jesica Kartono et al.1/5/2026

arXiv PDF

Key Summary

•COMPASS is a new framework that turns a company’s rules into thousands of smart test questions to check if chatbots follow those rules.
•It builds two kinds of tests: base queries (clear and simple) and edge queries (sneaky and tricky) to probe both everyday use and adversarial cases.
•Across eight industries and 5,920 verified queries, models answered allowed requests very well (over 95%) but often failed to refuse forbidden ones (only 13–40% on basic denials).
•Under adversarial edge cases, refusal collapsed for some top models to below 10%, showing a major weakness in enforcing denylist rules.
•Adding more facts with retrieval (RAG) barely helped denylist enforcement, suggesting the problem is reasoning about rules, not missing information.
•Prompt tweaks and few-shot examples helped a little, but not enough; an external pre-filter blocked most forbidden inputs but over-rejected many allowed ones.
•Failure modes showed patterns like ‘say no, then answer anyway’ in some models, and ‘directly answer the forbidden question’ in others.
•A small policy-aware fine-tune (LoRA) transferred across domains and raised adversarial refusals to about 60–62% in tests, hinting the skill is learnable.
•COMPASS offers a repeatable, policy-specific way for organizations to audit and improve AI safety aligned with their own rules.

Why This Research Matters

Real organizations live and die by their policies: hospitals must avoid dangerous medical advice, banks must avoid unlicensed investment guidance, and public offices must avoid partisan content. COMPASS helps teams see past surface-level safety and measure the exact risks at their own denylist boundaries. By revealing that models are strong at allowed tasks but weak at refusing forbidden ones—especially under adversarial wording—it targets the risk that causes legal, ethical, and reputational harm. The framework is reusable across domains, making ongoing audits and model upgrades measurable and fair. It also shows which mitigations help and where they backfire, guiding practical, balanced safety improvements. In short, COMPASS makes enterprise AI safer, more trustworthy, and more compliant with the rules that matter most locally.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine your school has its own rules—like what games are okay at recess and what’s not allowed. Now imagine every class uses a smart helper that must follow your school’s rules, not just general good behavior.

🥬 Filling (The Actual Concept): What it is: Large Language Models (LLMs) are helpful text tools, but when used by a hospital, a bank, or a city office, they must follow that organization’s special allowlist (what’s okay) and denylist (what’s not okay). How it works today: Most safety tests check universal harms (like toxic language), but not each company’s unique rules (like “no medical diagnoses” or “no advice about investing”). Why it matters: If a chatbot gives forbidden info—like a diagnosis—it can cause harm, break laws, or hurt trust.

🍞 Bottom Bread (Anchor): A hospital bot may share clinic hours (allowed) but must refuse “What dosage should I take?” (denied).

🍞 Top Bread (Hook): You know how teachers check both easy questions and trick questions on a test? That’s because students can look fine on easy parts but stumble on tricky edges.

🥬 Filling (The Actual Concept): The Problem: Organizations need a standard way to test both normal and sneaky cases against their own policies. What people tried: 1) General safety benchmarks (great for universal harms, not company rules). 2) Manual spot checks (slow and hard to compare over time). 3) Policy-as-prompt (put rules in the system message), but small prompt changes shift results and don’t measure robustness. The Gap: No scalable, repeatable evaluation that turns any organization’s policies into a living, tailored test set. Why it matters: Policies change, teams change models, and audits must compare versions fairly.

🍞 Bottom Bread (Anchor): A city chatbot might be fine at telling trash pickup times (allowed) but must avoid political endorsements (denied). Without a targeted test, you won’t notice it slips up during election season.

🍞 Top Bread (Hook): Imagine a club with a door list. The guard needs both the yes-list and the no-list, and has to spot sneaky attempts to get in.

🥬 Filling (The Actual Concept): Before this work: LLMs often looked safe because they handled allowed stuff well. But that hid a weakness: refusing forbidden requests, especially when wrapped in clever wording. Consequences: In healthcare, giving dosing advice could be dangerous; in finance, investment recommendations could be illegal; in HR, biased filtering could violate laws. The missing piece: a framework that automatically generates clean “base” tests and tricky “edge” tests from any policy set, then judges the chatbot’s responses at scale.

🍞 Bottom Bread (Anchor): A travel assistant can explain refund rules (allowed), but must not compare competitor platforms (denied). A good test suite must catch both.

🍞 Top Bread (Hook): Think of a compass that helps you navigate different forests, not just one park. Each company’s rules are a new forest.

🥬 Filling (The Actual Concept): COMPASS: It takes an organization’s allowlist and denylist and produces many realistic queries that check routine compliance and push boundaries through adversarial phrasing. It then has a judge model score whether responses are aligned. Why it matters: It proves models excel at “what they can do” but are fragile at “what they must not do,” which is exactly where real-world risk lives.

🍞 Bottom Bread (Anchor): Across eight domains (like automotive and telecom), COMPASS showed >95% accuracy on allowed tasks but only 13–40% on denying forbidden requests—dropping below 10% for some adversarial cases.

02Core Idea

🍞 Top Bread (Hook): You know how a great coach turns team rules into practice drills that test both basics and trick plays? That’s how you build real toughness.

🥬 Filling (The Actual Concept): The “Aha!” Moment: Turn any organization’s rules into a battery of base and edge-case queries, then use a judge to score whether the chatbot follows the rules—measuring both helpfulness (do allowed things) and restraint (refuse forbidden things).

How it works (like a recipe): 1) Input: organization context plus allowlist/denylist. 2) Generate base queries that clearly test each policy. 3) Validate base queries to ensure they target the right policy. 4) Generate edge queries that hide or twist intent (adversarial). 5) Validate edge queries. 6) Ask the target chatbot to answer; 7) Use a judge LLM to rate refusal and policy adherence; 8) Compute a Policy Alignment Score (PAS). Why it matters: Without edge cases, you’d believe the model is safe; with them, you see if it resists the sneaky stuff.

🍞 Bottom Bread (Anchor): If a hospital policy forbids giving diagnoses, an edge query might ask in a long hypothetical. The model must still refuse.

🍞 Top Bread (Hook): Three analogies to nail it down.

🥬 Filling (The Actual Concept): 1) School Hall Pass: Base queries are regular hallway checks; edge queries are students trying polite tricks to skip class. PAS is the principal’s report card. 2) Airport Security: Allowlist is your boarding pass; denylist is banned items. Edge queries are fancy disguises; the judge is the scanner plus guard judgment. 3) Driver’s Test: Base maneuvers are stop-and-go; edge maneuvers are sudden obstacles. You pass only if you’re safe under stress.

🍞 Bottom Bread (Anchor): A telecom bot should help set APN settings (allowed) but refuse SIM-swapping instructions (denied), even if hidden in a long customer story.

🍞 Top Bread (Hook): Before vs After—like going from a spelling test to a spelling bee under bright lights.

🥬 Filling (The Actual Concept): Before: Models look great because allowed tasks are easy; universal safety tests miss company-specific lines. After: COMPASS reveals strong helpfulness but weak refusal, especially under adversarial phrasing. Now teams can compare models, prompts, and mitigations apples-to-apples.

🍞 Bottom Bread (Anchor): In finance, models share product terms (allowed) but must refuse investment advice (denied). COMPASS shows refusal is the fragile side.

🍞 Top Bread (Hook): Why does this work? Think of checking both the key facts and the tricky misunderstandings.

🥬 Filling (The Actual Concept): Intuition: Base queries prove the model understands the allowed space; edge queries pressure-test the denylist boundary with realistic obfuscations (like indirect references or analogies). The judge LLM standardizes grading, so results are comparable over time and across models. Without this mix, you under-measure real risk.

🍞 Bottom Bread (Anchor): A government chatbot avoids partisan endorsements (denied) even when phrased as a research question packed with legal quotes.

🍞 Top Bread (Hook): Let’s break the idea into building blocks like LEGO.

🥬 Filling (The Actual Concept): Building Blocks: 1) Base query synthesis; 2) Base validation; 3) Edge query synthesis using six adversarial strategies; 4) Edge validation; 5) Response judging for refusal and adherence; 6) PAS scoring by query type (allowed/denied × base/edge). Why it matters: Each block prevents a specific failure—like preventing noisy queries, catching hidden violations, and keeping scoring fair.

🍞 Bottom Bread (Anchor): The result is a living, reusable test suite for your organization’s AI assistant.

03Methodology

🍞 Top Bread (Hook): Imagine making a science fair test kit: you gather rules, design simple tests, design trick tests, and set up a fair grader. Then you measure.

🥬 Filling (The Actual Concept): High-level overview: Input (policies + context) → Step A: Base query generation → Step B: Base validation → Step C: Edge query generation → Step D: Edge validation → Step E: Chatbot answers → Step F: Judge evaluates refusal and policy adherence → Output: PAS scores by query type.

Why it matters: Each step keeps the test clean, challenging, and fairly graded. Skip one, and your results can lie.

🍞 Bottom Bread (Anchor): For healthcare, base allows “clinic hours,” base denies “dosage advice,” edge denies hide dosage requests in hypotheticals; the judge scores each.

—

🍞 Top Bread (Hook): You know how you practice basics before scrimmage? Base queries are the basics.

🥬 Filling (The Actual Concept): Base Query Synthesis: What it is: For each allowlist policy, generate straightforward allowed queries; for each denylist policy, generate straightforward denied queries. How it works: An LLM writes 10 natural-sounding queries per policy using the organization’s context (vary styles, complexity, and personas). Why it matters: This checks routine correctness under clear conditions.

🍞 Bottom Bread (Anchor): Automotive allowed: “Show IIHS safety ratings for the 2025 Vertex.” Denied: “Compare AutoVia to competitor X.”

🍞 Top Bread (Hook): Quality control matters—like checking a quiz for trick wording that wasn’t intended.

🥬 Filling (The Actual Concept): Base Validation: What it is: Another LLM tags which policies each query matches. How it works: Allowed queries must match their allowlist policy and trigger no denylist; denied queries must match their denylist (overlap with allowlist is fine because they’re meant to be refused). Why it matters: Keeps the dataset clean so scores mean what you think they mean.

🍞 Bottom Bread (Anchor): A healthcare “facility info” question that accidentally asks for a diagnosis is filtered out.

🍞 Top Bread (Hook): Now the scrimmage: tricky plays that look legit but are actually illegal—or the reverse.

🥬 Filling (The Actual Concept): Edge Case Synthesis: What it is: Build challenging, realistic queries that push boundaries. Two types: Allowed-edge (looks suspicious but is truly allowed) to test over-refusal; Denied-edge (violations hidden under obfuscation) to test under-refusal. How it works: For denied-edge, apply six adversarial transformations—Regulatory Interpretation, Analogical Reasoning, Statistical Inference, Context Overflow, Hypothetical Scenario, and Indirect Reference—in short and long forms. Why it matters: Real users and adversaries rarely ask in a clean, obvious way.

🍞 Bottom Bread (Anchor): A finance query that embeds investment advice inside a long, academic-sounding comparison is still a denial case.

🍞 Top Bread (Hook): Even referees need training. The judge must be accurate and fair.

🥬 Filling (The Actual Concept): Edge Validation and Judging: What it is: Separate validators confirm allowed-edge is actually allowed and denied-edge truly violates the intended deny policy. Then a judge LLM evaluates each chatbot response. How it works: The judge checks two things—Refusal (did it decline answering?) and Policy Adherence (did it avoid violating any denylist and respect allowlist?). Why it matters: This produces a consistent label: Aligned for allowed = answered and adhered; Aligned for denied = refused and adhered.

🍞 Bottom Bread (Anchor): A telecom chatbot that politely refuses a SIM-swapping guide scores aligned; one that half-refuses but then gives steps does not.

🍞 Top Bread (Hook): Report cards need a simple grade.

🥬 Filling (The Actual Concept): Policy Alignment Score (PAS): What it is: The fraction of queries answered in the appropriate way per type (allowed/denied × base/edge). How it works: For allowed, aligned if it answers compliantly (no refusal, adherent); for denied, aligned if it refuses and adheres. Why it matters: PAS lets you compare models, prompts, and mitigations across domains.

🍞 Bottom Bread (Anchor): An education bot can explain study tips (allowed) but must refuse cheating guides (denied). PAS shows how often it gets that right.

🍞 Top Bread (Hook): Secret sauce time—what makes this kit sharp instead of soft?

🥬 Filling (The Actual Concept): The Secret Sauce: 1) Systematic adversarial transformations produce realistic, policy-specific traps. 2) Dual validation keeps both base and edge queries on-target. 3) Judge-based scoring normalizes evaluation across models. 4) Large, diverse, multi-domain dataset (5,920 verified queries across 8 industries) gives breadth and transfer insights. Why it matters: It exposes the crucial asymmetry—models nail allowed tasks but fail to enforce denials, especially when attacked by subtle phrasing.

🍞 Bottom Bread (Anchor): In numbers: more than 95% on allowed base, but only 13–40% on denied base, dropping below 10% for some denied-edge cases—exactly the risk organizations care about.

04Experiments & Results

🍞 Top Bread (Hook): Picture a big tournament: eight sports (industries), many teams (models), and four event types (allowed/denied × base/edge). Who really wins under pressure?

🥬 Filling (The Actual Concept): The Test: COMPASS built verified query sets for eight sectors—Automotive, Government, Financial, Healthcare, Travel, Telecom, Education, Recruiting—totaling 5,920 queries. Each model faced four categories: Allowed-Base, Allowed-Edge, Denied-Base, Denied-Edge. The key metric, PAS, checked if models answered when they should and refused when they must.

The Competition: Seven state-of-the-art models across proprietary, open weights, and MoE variants. They were evaluated under a realistic system prompt, with and without retrieval (RAG). Mitigation strategies—explicit refusal prompting, few-shot examples, and pre-filtering—were tested to see what helps.

The Scoreboard (with context):

Allowed Base: Models were excellent, often near or at 100%—like scoring A+ on clear, in-policy tasks.
Allowed Edge: Still strong (frontier models around 92–97%), though some open models dipped to ~80%.
Denied Base: Weak—only about 13–40% aligned refusals. That’s like many teams missing basic offside calls.
Denied Edge (adversarial): Catastrophic for some: GPT-5 around 3.3% and Llama-3.3-70B about 4.2%; others hovered 17–21%. That’s failing the pop quiz designed to catch hidden rule breaks.

Surprising/Key Findings:

Cross-Domain Consistency: High scores on allowed tasks across all eight domains, but denial enforcement crumbled most in Education and Recruiting edge cases (around 5–7% for some models)—proving the issue isn’t just one domain.
Scaling Doesn’t Fix Denials: Bigger models improved allowed scores but not denied-edge robustness, which stayed near zero across sizes in some families.
RAG Isn’t the Cure: Adding extra documents kept allowed strong but barely improved denied cases. So the problem is reasoning about rules and hidden intent—not missing facts.
Mitigations Trade-offs: Prompt tweaks gave small gains on denials; few-shot demos boosted some denials (especially denied-edge) but sometimes made the model over-cautious on allowed-edge. Pre-filtering (an external classifier) caught almost all denials (>96%) but wrongly blocked many allowed-edge queries (often dropping into the 30% range)—safe but unhelpful.
Failure Modes: Proprietary models often performed a “refusal-answer hybrid”—they said “no” then answered anyway; open-weight models more often gave direct violations (just answered). This shows different training tensions.
Transferability: A small policy-aware fine-tune (LoRA) trained on seven domains then tested on the eighth raised denied-edge PAS to about 60–62% while preserving allowed performance, suggesting the core refusal skill can generalize.

🍞 Bottom Bread (Anchor): Think of a student who aces homework (allowed tasks) but trips on trick questions (denied-edge). COMPASS reveals who is truly exam-ready for real-world stakes.

🍞 Top Bread (Hook): What about the popular add-ons—RAG, few-shot, and pre-filters?

🥬 Filling (The Actual Concept):

RAG: Adds context but little denial help—like giving more encyclopedia pages when the real problem is saying “no” to unfair requests.
Few-shot: Teaches by example; improves some refusals but can make the model too shy on legit tricky queries.
Pre-filtering: A bouncer at the door—stops nearly all bad guests but also turns away many good ones wearing unusual outfits.

🍞 Bottom Bread (Anchor): For a finance bot, a pre-filter might block most hidden investment-advice requests (good) but also reject a nuanced product-terms question (bad).

05Discussion & Limitations

🍞 Top Bread (Hook): Imagine we now know where the cracks are in the dam. What are the limits, costs, and next steps to fix them safely?

🥬 Filling (The Actual Concept): Limitations: The eight scenarios are realistic but simulated; some industries (like legal services or defense) may have unique policies not covered. Edge cases used six adversarial strategies—strong but not exhaustive of real-world tricks. The judge LLM is high-agreement with humans but still a model, so bias or drift is possible. Finally, pre-filtering trades safety for helpfulness by over-refusing allowed-edge queries, which may frustrate users.

Required Resources: To run COMPASS, you need the organization’s written policies, brief context documents, access to LLMs for generation and judging, and light human review for spot-checking edge validations and judge reliability.

When NOT to Use: If you don’t have explicit policies yet; if your goal is to measure universal safety only; if the bot handles extremely sensitive data you cannot synthesize even for testing; or if your application values maximum creativity over strict policy limits.

Open Questions: How can we train models to keep strong helpfulness while being rock-solid at refusing forbidden asks? Can we make judges more transparent and auditable? How do we keep tests fresh as adversaries evolve? Can we distill policy reasoning into lighter guardrails without overblocking? And how far can cross-domain fine-tuning go before needing domain-specific add-ons?

🍞 Bottom Bread (Anchor): Think of upgrading from a single metal lock to a layered security system—policy-aware training, smarter filters, and regular audits—so the door stays open for the right visitors and closed to the wrong ones.

06Conclusion & Future Work

🍞 Top Bread (Hook): Time to wrap it up like a clear map home.

🥬 Filling (The Actual Concept): 3-Sentence Summary: COMPASS turns any organization’s allowlist and denylist into robust base and edge tests, then scores chatbot behavior with a judge to see if it answers when allowed and refuses when required. Across eight industries and 5,920 queries, models were excellent at allowed tasks (>95%) but weak at enforcing denials (often 13–40%, and under 10% for adversarial edge cases). Common add-ons like RAG and prompting offered limited help; pre-filtering caught denials but overblocked allowed-edge queries, while small policy-aware fine-tuning showed promising generalization for refusal skills.

Main Achievement: Establishing a scalable, policy-specific, and adversarially aware evaluation framework that reveals the hidden asymmetry between “what models can do” and “what they must not do.”

Future Directions: Develop training that jointly optimizes helpfulness and refusal robustness; design transparent judge models; create adaptive adversarial generators; explore hybrid guardrails that reduce over-refusal; and extend to more domains and evolving policies.

Why Remember This: In real deployments, the biggest risks live at the denylist boundary. COMPASS shines a bright, repeatable light there—so organizations can trust their AI to be both useful and safe under pressure.

Practical Applications

•Audit a hospital chatbot to ensure it shares facility info but refuses diagnoses or dosing advice.
•Test a finance assistant so it provides product terms but rejects investment recommendations, even when asked indirectly.
•Evaluate a city government bot to confirm it delivers public services info but avoids political endorsements.
•Check a telecom support agent to verify it helps with setup but refuses SIM swapping or hacking instructions.
•Validate a travel platform bot that explains refund policies but doesn’t compare competing platforms.
•Assess an education tutor to ensure it teaches study skills but rejects cheating or plagiarism requests.
•Benchmark HR recruiting assistants to prevent discriminatory matching while giving neutral job guidance.
•Compare models (and prompts) before deployment to pick the one with the best denylist robustness for your policies.
•Tune or retrain guardrails and re-test quickly when policies change or new adversarial tactics appear.
•Deploy pre-filters with acceptance tests to quantify the trade-off between blocking violations and over-refusing allowed queries.

Version: 1