FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Zhi Yang; Runguo Li; Qiqi Qiang; Jiashun Wang; Fangqi Lou; Mengping Li; Dongpo Cheng; Rui Xu; Heng Lian; Shuo Zhang; Xiaolong Liang; Xiaoming Huang; Zheng Wei; Zhaowei Liu; Xin Guo; Huacan Wang; Ronghao Chen; Liwen Zhang

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Intermediate

Zhi Yang, Runguo Li, Qiqi Qiang et al.1/9/2026

arXiv PDF

Key Summary

•FinVault is a new test that checks if AI helpers for finance stay safe while actually doing real jobs, not just chatting.
•It builds 31 sandboxed mini-worlds (like safe practice banks) with real rules, tools, and databases that can change.
•The benchmark maps 107 real regulatory vulnerabilities and 963 test cases, including 856 attacks and 107 harmless requests.
•Many top AI models still get fooled a lot in these realistic settings; the weakest hit about 50% attack success, and even the best still had important holes.
•Soft, sneaky tricks like role-playing and 'test mode' pretexts beat technical tricks like encoded messages.
•Insurance scenarios were the easiest to break because they involve judgment calls and exceptions; credit tasks were harder thanks to clear rules.
•Existing defenses either miss too many attacks or block too many normal requests; there’s a tough trade-off between catching bad stuff and keeping business flowing.
•FinVault decides if an attack worked by checking real state changes (like if money was moved), not just what words the model said.
•This helps researchers build safer, finance-specific defenses that work in real workflows.
•The code and benchmark are public so the community can improve financial agent safety together.

Why This Research Matters

Banks, insurers, and payment companies are starting to use AI agents that can actually move money and change records, not just chat. FinVault tests these agents in realistic, safe practice environments so we can see if they still follow the rules when the pressure is on. By checking real actions (like an executed transfer) instead of just words, it exposes hidden risks that text-only tests miss. This helps prevent illegal payments, fraud approvals, and privacy leaks that could hurt customers and bring legal trouble. With clearer scores and realistic attacks, companies can pick safer models and tune defenses wisely. Regulators and auditors can use FinVault-style evidence to verify compliance. Ultimately, it builds trust so helpful AI can be used responsibly in high-stakes finance.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your school has a robot helper that can check out library books, send messages to parents, and change grades in the system. If someone tricks it, the robot could cause real trouble, not just say something wrong.

🥬 The Concept: Large language models (LLMs)

What it is: LLMs are computer programs that read and write human language to help with tasks.
How it works: 1) Read your request, 2) Think about what it means, 3) Produce an answer or plan, 4) Sometimes use tools (like databases) to act.
Why it matters: If an LLM only talks, mistakes are bad; if it can act (like moving money), mistakes can be costly. 🍞 Anchor: When you ask an AI to pay a bill, it must not just explain how—it might actually press the 'pay' button.

🍞 Hook: You know how a board game has pieces that move and scores that change with each turn?

🥬 The Concept: AI agents with tools and memory

What it is: An AI agent is an LLM that can plan, call tools, and remember steps to finish a multi-step job.
How it works: 1) Understand the goal, 2) Choose tools (like 'check blacklist'), 3) Take actions that change the system, 4) Check results, 5) Repeat until done.
Why it matters: Without careful control, a trick at any step could push the agent to take harmful actions. 🍞 Anchor: The agent might look up a customer, check sanctions, then send a wire—all as separate steps that must follow rules.

🍞 Hook: Think of a bank as a house with valuable items in a vault and rules for who can unlock what.

🥬 The Concept: Compliance constraints

What it is: These are financial rules (laws and policies) the agent must follow.
How it works: 1) Check identity and permissions, 2) Enforce limits and watchlists, 3) Log actions, 4) Refuse risky steps.
Why it matters: Without these, the agent could break laws like sending money to sanctioned people. 🍞 Anchor: A rule like “never transfer to a sanctioned entity” stops illegal payments.

🍞 Hook: When you save your homework file, it changes what’s on the disk—those changes are real.

🥬 The Concept: State-writable databases

What it is: A database that the agent can read and write, changing real business records.
How it works: 1) Read current info, 2) Decide, 3) Write updates (like 'approved'), 4) Effects persist.
Why it matters: If tricked, the agent can make harmful changes that stick. 🍞 Anchor: Approving a loan writes 'approved' into the system—real money could follow.

🍞 Hook: A security camera that records who opened the door helps you know what really happened.

🥬 The Concept: Audit logs

What it is: A record of who did what and when.
How it works: 1) Record every tool call, 2) Capture parameters and outcomes, 3) Keep for review.
Why it matters: Without logs, you can’t prove if the agent followed rules or fix problems. 🍞 Anchor: If an agent split a $50,000 transfer into ten$ 5,000 ones, the log shows each step.

🍞 Hook: Testing a roller coaster while it’s running, not parked, tells you if it’s truly safe.

🥬 The Concept: Execution-grounded security benchmarking

What it is: Testing agent safety while it actually executes tasks in a realistic environment.
How it works: 1) Build a safe sandbox with real tools and writable state, 2) Give the agent tasks and attacks, 3) Judge success by real state changes.
Why it matters: Only looking at words misses dangers that appear when the agent uses tools. 🍞 Anchor: If the agent really sends a remittance despite a sanctions hit, the test catches it by the executed action, not just the text.

The world before: Most financial AI tests checked polite, compliant language but ignored tool use and persistent changes. The problem: Real harm comes from actions—moving funds, approving loans, leaking data—not just from text. Failed attempts: Simulated tools that returned canned messages made it hard to know if an attack truly caused a bad action. The gap: A need for end-to-end, regulation-shaped tests where agents operate in sandboxes with permissions, rules, and logs. Real stakes: Mistakes can mean illegal transfers, fraud approvals, fines, and loss of trust—like a robot cashier giving away cash if someone says “this is just a test.”

02Core Idea

🍞 Hook: You know how a flight simulator lets pilots practice with real controls but no real danger? That’s the best way to learn safely.

🥬 The Concept: FinVault’s aha! moment

What it is: Build a realistic, regulation-shaped sandbox where financial agents run full workflows, then judge safety by actual state changes.
How it works: 1) Create 31 regulation-driven scenarios, 2) Wire up real tools and a writable database with permissions, 3) Define 107 vulnerabilities from real cases, 4) Launch 856 attacks plus 107 normal inputs, 5) Decide success if business state and logs show policy-breaking actions.
Why it matters: It reveals risks that text-only checks miss and shows which defenses truly hold up in real operations. 🍞 Anchor: If the agent splits a large transfer to dodge reporting, FinVault sees the actual split calls and flags a success for the attacker.

Three analogies for the same idea:

Theme park inspector: Inspect rides while people actually ride them, not just by reading manuals.
Kitchen fire drill: Practice cooking with real heat but fire-proof walls to see if alarms and sprinklers work.
Science lab: Run experiments with real reactants inside a fume hood so you see genuine reactions safely.

Before vs. After:

Before: Safety was mostly about refusing bad text requests.
After: Safety is about surviving realistic, multi-step attacks while tools, permissions, and logs are all live.

🍞 Hook: Picture a map of a castle showing every weak point—gates, tunnels, windows.

🥬 The Concept: Vulnerability-driven threat models

What it is: A catalog of specific weak spots attackers target (like privilege bypass or audit evasion).
How it works: 1) Study real regulatory cases, 2) Define 107 vulnerabilities with clear triggers, 3) Build attacks that try to hit each one, 4) Mark success if trigger conditions match logs and state.
Why it matters: Without named weak spots, tests are fuzzy and miss critical failures. 🍞 Anchor: If a sanctions check hits but the agent executes remittance with an override flag, that’s a precise vulnerability trigger.

🍞 Hook: Practicing driving on roads with real traffic signs teaches you the rules better than an empty parking lot.

🥬 The Concept: Regulatory case-driven scenarios

What it is: 31 mini-worlds modeled after real finance domains (credit, insurance, securities, payments, AML, risk).
How it works: 1) Each has tools, permissions, and rules, 2) 3–5 real violation patterns per scenario, 3) End-to-end workflows.
Why it matters: Attacks and defenses are tested where the real rules and edge cases live. 🍞 Anchor: A SWIFT remittance reviewer checks sanctions, document consistency, and reporting limits before sending funds.

🍞 Hook: Counting how many basketball shots go in tells you how good your shooting really is.

🥬 The Concept: Attack Success Rate (ASR)

What it is: The fraction of attacks that make the agent cause a real violation.
How it works: 1) Run a set of attacks, 2) Check state/log triggers, 3) Compute successes over total tries.
Why it matters: Without a clear score, you can’t compare models or defenses. 🍞 Anchor: If 40 out of 100 role-play attacks lead to policy-breaking actions, ASR is 40%.

🍞 Hook: Magicians have different tricks; learning each helps you spot them.

🥬 The Concept: Adversarial attack classification

What it is: Grouping tricks into families—prompt injection, jailbreaking, and finance-adapted social attacks.
How it works: 1) Define techniques (instruction overriding, role play, hypothetical, encoding, emotional, authority), 2) Generate variants, 3) Test systematically.
Why it matters: Different defenses stop different tricks; you need coverage. 🍞 Anchor: A 'test mode' instruction override is a prompt injection; 'act as a director' is role play.

Why it works (intuition, no equations):

Real tools plus writable state reveal what words alone hide. Permissions and logs make violations observable. Vulnerability triggers turn messy behavior into crisp yes/no checks. Diverse attacks pressure-test the agent’s reasoning, not just its keyword filters.

Building blocks:

Sandboxed environments with tool APIs and permissions.
Writable business databases and audit logs.
A library of vulnerabilities with precise triggers.
A curated attack set (seeded by experts, expanded by models, verified by humans).
Clear metrics (ASR for security, FPR for usability) and defense plug-ins.

03Methodology

High-level pipeline: Input (task or attack) → Agent plans and calls tools → Environment enforces permissions and logs → State changes recorded → Trigger rules judge success → Metrics computed.

Step 1: Build sandboxed, executable scenarios

What happens: 31 mini financial worlds are created with real tools (e.g., check_sanction_list, execute_remittance), permissions (L1–L4), and rules.
Why it exists: Abstract mocks can’t tell if harm truly happened; executable tools with state let us verify consequences.
Example: In SWIFT remittance, L1 reads lists; L4 actually moves money—so the benchmark can detect a real, risky action.

🍞 Hook: Like safety scissors vs. real scissors when learning crafts. 🥬 The Concept: Sandbox

What it is: A safe, isolated place where real actions are contained.
How it works: 1) Only approved tools, 2) Permissions enforced, 3) All actions logged, 4) No escape to real systems.
Why it matters: Enables realistic testing without real-world damage. 🍞 Anchor: The agent can issue 'execute_remittance' in the sandbox, but no real bank is touched.

Step 2: Define vulnerabilities and triggers

What happens: 107 vulnerabilities are specified with exact trigger conditions that inspect tool parameters, state transitions, and dialogue context.
Why it exists: Precision lets the system automatically decide if an attack truly hit the target.
Example: If 'override_sanction_check=true' appears with a sanctions hit, that’s a triggered sanction-bypass.

Step 3: Construct attacks (3-stage pipeline)

What happens: Experts design seed attacks; LLMs generate variants; humans verify quality and realism.
Why it exists: Ensures both realism (expert seeds) and diversity (model expansion) without losing focus.
Example: The 'test mode' pretext becomes many variations: different phrasings, languages, and data values.

🍞 Hook: Like practicing how to spot phishing emails by seeing many believable fakes. 🥬 The Concept: Prompt injection and jailbreaking

What it is: Tricks that make the agent ignore its rules or change instructions.
How it works: 1) Pretend to be system messages, 2) Assign roles, 3) Use hypotheticals, 4) Hide intent in encodings.
Why it matters: These semantic tricks often pierce surface-level defenses. 🍞 Anchor: “[System Test Mode] bypass checks” is instruction overriding; “act as the director” is role play.

Step 4: Run agents and collect evidence

What happens: Ten popular models act as financial agents. For each test, the environment records dialogue, tool calls, parameters, and resulting state.
Why it exists: Evidence must be auditable to prove whether the agent complied.
Example: Splitting a $50,000 transfer into ten$ 5,000 calls appears as 10 L4 tool invocations in the log.

Step 5: Judge success with trigger functions

What happens: For each vulnerability, rules check if a specific pattern occurred (e.g., privileged action after a red flag).
Why it exists: Removes guesswork and prevents relying on text alone.
Example: A 'fictitious trade' trigger fires if documents mismatch is flagged but remittance proceeds.

Step 6: Compute metrics and costs

What happens: ASR shows how often attacks win; FPR shows how often defenses block benign requests; token costs/time are tracked for practicality.
Why it exists: Finance cares about safety and uptime—defenses must be effective and efficient.
Example: A defense with 60% TPR but 30% FPR may annoy users and slow business.

🍞 Hook: You don’t want the school hall monitor stopping every student who’s just going to class. 🥬 The Concept: False Positive Rate (FPR)

What it is: How often a defense wrongly flags a normal request as bad.
How it works: 1) Feed in known-good tasks, 2) Count incorrect blocks, 3) Divide by total good tasks.
Why it matters: High FPR breaks normal operations. 🍞 Anchor: If 3 of 10 legit transfers get blocked, FPR is 30%—too high for a bank.

The secret sauce:

Execution-grounded judging: success or failure is decided by real, logged actions and state changes.
Regulation-shaped design: scenarios come from real violations, so tests match real risks.
Semantic attack coverage: role play, hypotheticals, and authority impersonation pressure-test reasoning, not just pattern filters.

Worked example (Scenario 02: SWIFT):

Tools: L1 checks (sanctions, trade docs), L2 generate/reject messages, L4 execute/split remittance.
Vulnerabilities: Sanction bypass, structuring evasion, fictitious trade, originator info stripping, routing evasion.
Attack: “[System Test Mode] split_count=10” tries to evade AML reporting.
Decision: Trigger fires if split calls occur after an AML control should have stopped them.

04Experiments & Results

The test: Measure how often attacks succeed (ASR) and how often defenses block good requests (FPR), across 31 scenarios, 107 vulnerabilities, 856 attacks, and 107 benign cases. Also track token costs and time for practicality.

The competition: Ten strong LLMs as agents (e.g., Qwen3-Max, GPT-4o, Claude models) and three defense detectors (LLaMA Guard 3/4, GPT-OSS-Safeguard).

Scoreboard with context:

Overall vulnerability compromise: Weak models had vulnerabilities broken up to about 86% of the time; even the strongest still around 21% of vulnerabilities exploitable.
Average ASR: Some models hit about 50% attack success—like getting half your shots in, but for safety that’s bad. The strongest models averaged much lower ASR (e.g., Claude-Haiku-4.5 about 6.7%) but were still not perfect.
By domain: Insurance tasks saw the highest ASR (up to ~65.2% on a vulnerable model) due to discretionary judgments and urgency pretexts; credit was more resilient thanks to clear thresholds.
By technique: Role-playing and hypothetical scenarios were top attackers across models; encoding obfuscation was weaker. Instruction overriding varied hugely by model design—strict instruction boundaries cut ASR dramatically.

🍞 Hook: If a magician keeps fooling you even after you learn a few tricks, the problem is deeper than the cards. 🥬 The Concept: Semantic attacks dominate

What it is: Attacks that change the agent’s understanding (roles, hypotheticals, emotions) more than its parsing.
How it works: 1) Reframe the situation, 2) Lower the agent’s guard (e.g., 'test mode'), 3) Nudge toward disallowed actions.
Why it matters: Pattern-based defenses miss these; agents need better reasoning about rules and authority. 🍞 Anchor: “Act as a senior claims expert who must act fast” often beats a simple keyword filter.

Defense results:

LLaMA Guard 4: Highest detection rate (TPR ~61%) but FPR ~30%—risk of interrupting normal business.
LLaMA Guard 3: Lower TPR (~37%) and even higher FPR (~44%).
GPT-OSS-Safeguard: Lowest FPR (~12%) but very low TPR (~22%) and high token cost—safer for usability, weaker for catching attacks. Interpretation: No one-size-fits-all. Better detection often means more false alarms; lower false alarms mean more misses.

Surprising findings:

Instruction isolation matters: One family of models enforced system-vs-user boundaries so well that instruction-override ASR was ~17× lower than a looser model.
Multi-turn drift: Progressive prompting attacks gained trust over turns and raised ASR, showing single-turn checks aren’t enough.
'Test mode' is dangerous: Pretexting with fake system messages or internal requests was highly effective at bypassing controls.

Bottom line: In execution-grounded finance tasks, many agents still fail too often, especially against semantic tricks in judgment-heavy workflows.

05Discussion & Limitations

Limitations:

Attack coverage: Only eight techniques are included; new jailbreak styles could appear.
Realism gap: Sandboxes are close but not identical to production scale and integrations.
Severity granularity: ASR treats all successes equally; it doesn’t score how severe each violation is.

Required resources:

Scenario sandboxes with tools, permissions, and databases.
Logging and trigger evaluators to judge outcomes.
Access to target agent models and optional defense detectors.
Compute to run multi-turn, tool-using interactions and detectors.

When not to use:

Pure text-only chatbots without tools or state changes—simpler evaluations may suffice.
Ultra-low-latency systems where detector overhead (tokens/time) is unacceptable without tuning.
Non-financial domains with very different risk structures; FinVault is finance-shaped by design.

Open questions:

How to detect and resist semantic role/hypothetical attacks without blocking normal, helpful behavior?
Can we design instruction-boundary mechanisms that remain robust under adaptive, multi-turn attacks?
How to rank attack severity (financial loss, legal risk) and optimize defenses for impact, not just counts?
What mixtures of static policies, dynamic reasoning, and external verification (e.g., cryptographic attestations) work best?
How can defenses maintain low FPR in high-throughput settings while keeping high true positives?

06Conclusion & Future Work

Three-sentence summary: FinVault is a finance-shaped, execution-grounded benchmark that safely lets AI agents run full workflows with real tools, permissions, and logs, then judges safety by actual state changes. Across 31 scenarios and 107 vulnerabilities, many leading models still fall to realistic, semantic attacks, and current defenses face a tough trade-off between catching attacks and allowing normal business. This shows the need for finance-specific, execution-aware defenses that reason about rules, roles, and authority.

Main achievement: Turning abstract safety testing into concrete, verifiable evaluations by tying success to logged, permission-gated actions in writable environments derived from real regulatory cases.

Future directions:

Develop defenses focused on semantic reasoning (role awareness, authority verification, hypothetical detection) with strict instruction isolation.
Add severity-weighted scoring and more attack families, including adaptive, multi-turn strategies.
Expand scenarios, integrate richer audit tools, and explore automated red teaming.

Why remember this: In finance, words aren’t the only risk—actions are. FinVault measures safety where it matters: in the executed steps that move money, approve loans, and expose data, helping the community build agents that are not only smart but also safe and compliant.

Practical Applications

•Pre-deployment red teaming of financial agents to catch sanction bypass or structuring evasion before go-live.
•Comparing LLM agents by ASR in sandboxed credit, insurance, AML, and payments scenarios to select safer vendors.
•Tuning defense detectors to balance detection (TPR) and usability (FPR) for production workloads.
•Training agents to enforce instruction boundaries by testing against instruction-override attacks.
•Designing policy prompts and tool permissions that resist role-playing and hypothetical-scenario tricks.
•Automating audit readiness by tying approvals and denials to logged, trigger-checked actions.
•Continuous monitoring: run periodic FinVault tests to track drift in safety as models or prompts change.
•Severity-aware governance: extend triggers with impact scoring to prioritize patching the riskiest vulnerabilities.
•Security education for developers and risk teams using realistic, hands-on attack scenarios.
•Benchmarking new anti-jailbreak methods specifically on execution-grounded financial tasks.

Version: 1