Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

Haoyu Dong; Pengkun Zhang; Yan Gao; Xuanyu Dong; Yilin Cheng; Mingzhe Lu; Adina Yakefu; Shuxin Zheng

Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

Intermediate

Haoyu Dong, Pengkun Zhang, Yan Gao et al.12/15/2025

arXiv PDF

Key Summary

•FINCH is a new test that checks whether AI can handle real finance and accounting work using messy, real spreadsheets, emails, PDFs, charts, and more.
•It builds 172 multi-step workflows from real companies (like Enron) with 1,710 spreadsheets and millions of cells, so the challenges feel like real office life.
•Workflows mix tasks such as data entry, cross-file lookups, calculations, modeling, validation, visualization, translation, and reporting.
•Experts spent 700+ hours cleaning, checking, and writing clear instructions and reference answers so each workflow is fair and realistic.
•FINCH judges AI in two ways: carefully by humans and automatically by an LLM judge that compares changes, screenshots, and formulas instead of just raw text.
•Even top AI agents struggled: GPT 5.1 Pro passed 38.4% of workflows (about 17 minutes each), and Claude Sonnet 4.5 passed 25.0%.
•AI especially fails when steps pile up (long workflows), when layouts are irregular, or when formulas encode hidden business logic.
•The benchmark reveals what’s still missing for office-ready AI: reliable cross-sheet retrieval, formula reasoning, structure-aware edits, and robust multimodal grounding.
•FINCH gives researchers a tough, realistic playground to build stronger, safer finance agents for real enterprises.

Why This Research Matters

Real businesses live in spreadsheets, emails, PDFs, and charts, not tidy toy tables, so AI must succeed in that world to be truly useful. FINCH shows exactly where today’s systems fail—like cross-file lookups, tricky formulas, and layout-preserving edits—so teams can fix what matters most. Better agents mean fewer expensive errors in budgets, audits, and reports, saving time and reducing risk. The benchmark’s human-plus-LLM judging mirrors how professionals check work, improving trust in AI results. As companies adopt AI co-pilots, FINCH helps choose the right tools and train better ones, speeding up safe automation. Over time, this can upgrade everyday office work, from month-end closes to investment analyses.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your class is planning a school fair. You track money in a big spreadsheet, get rules from a PDF, share updates by email, and make a chart to show sales. That’s a lot to juggle, and everything has to match perfectly.

🥬 The Situation (The World Before): For years, AI tools were great at small, clean tasks—answering short questions, filling simple tables, or fixing tidy spreadsheets. But real finance and accounting (F&A) work inside companies is nothing like a tidy worksheet. It’s messy, long, and spread across many files: huge multi-sheet workbooks with weird layouts, emails that explain goals, PDFs full of tables and charts, and changing versions where people tweak formulas and fix mistakes over time. Professionals don’t do just one step—they stitch lots of steps into a workflow: import data → clean and structure it → pull values from other files → calculate with formulas → make charts → write a report.

🍞 Anchor: Think of FINCH as bringing the entire school fair project—notes, messy sheets, charts, and all—into one challenge so we can see if an AI helper can actually run the fair end to end.

— Prerequisite Concepts (taught first, using the Sandwich pattern) —

🍞 Hook: You know how you use a grid notebook to track chores and allowance? Rows, columns, and boxes keep things organized. 🥬 Spreadsheets: A spreadsheet is a grid of cells for text, numbers, and formulas.

How it works: 1) Put data in cells, 2) Use formulas (like SUM) to compute results, 3) Link cells across sheets so updates ripple correctly.
Why it matters: Without spreadsheets, tracking money, totals, and changes becomes chaotic and error-prone. 🍞 Anchor: Your allowance sheet adds up weekly amounts and shows a running total.

🍞 Hook: Ever read a class handbook as a PDF with tables and pictures? 🥬 PDFs: A PDF is a document format that preserves layout exactly, including text, tables, and images.

How it works: 1) Pack content with fonts and positions, 2) Keep look the same on any device, 3) Often hard to copy tables perfectly.
Why it matters: If an AI can’t read PDFs well, it misses rules, numbers, or context needed for correct answers. 🍞 Anchor: A school trip PDF has a fee table; you must copy it accurately into your budget sheet.

🍞 Hook: When you text a friend, words carry meaning beyond letters. 🥬 Text: Text is written language that explains goals, steps, or notes.

How it works: 1) Combine words into instructions, 2) Use context to decide what to do, 3) Match terms to data fields.
Why it matters: If an AI misreads instructions, it will do the wrong task even if the numbers are right. 🍞 Anchor: An email says “Update the 2002 allocations”; that tells you which sheet and year to edit.

🍞 Hook: Solving a jigsaw puzzle takes multiple moves, not just one. 🥬 Multi-step Reasoning: Multi-step reasoning means solving big problems by chaining smaller steps in the right order.

How it works: 1) Plan steps, 2) Execute each step, 3) Check and fix mistakes, 4) Continue until the goal is met.
Why it matters: Without it, the AI might get stuck after the first step or combine steps incorrectly. 🍞 Anchor: Import data, then clean it, then calculate totals, then graph results—each depends on the last.

🍞 Hook: When two teachers grade the same test, you want them to agree. 🥬 Human Evaluation: Human evaluation is when trained people check if the AI truly did the requested job.

How it works: 1) Read instruction, 2) Compare input, reference, and AI output side by side, 3) Decide pass/fail based on correctness and completeness.
Why it matters: Automatic checks can miss subtle layout, logic, or formula errors; humans ensure fairness. 🍞 Anchor: A judge confirms the AI changed only what the instruction asked and didn’t break other formulas.

The Problem: Before FINCH, most tests were too clean: small tables, one file, one step, no tricky layouts, and no cross-file logic. That meant models looked great on paper but stumbled in the office, where spreadsheets are giant, references span many sheets, formulas hide business rules, and instructions live in emails or PDFs.

Failed Attempts: Prior benchmarks focused on narrow skills—like text question-answering, simple table math, or one-shot formula writing. They rarely captured version histories (how files change), mixed media (PDFs + spreadsheets + charts), or the need to avoid over-editing (don’t break unrelated cells!). As a result, models trained or tested this way didn’t learn to survive real-world messiness.

The Gap: We needed a benchmark made from real enterprise artifacts, preserving their size, structure, noise, and multi-step nature—plus careful judging that respects multiple correct solutions (e.g., different but equivalent formulas).

Real Stakes: In F&A, errors cost money and trust. A wrong cross-sheet link can flip profit to loss. A broken formula can hide risk. Messy imports or mistranslations can derail audits. Getting this right means safer budgets, clearer reports, and fewer late nights fixing spreadsheets.

🍞 Bottom Bread: FINCH is like a realistic dress rehearsal for the school fair—if the AI can run the whole messy event in practice, you can trust it more on the big day.

02Core Idea

🍞 Hook: Think of a relay race where each runner depends on the last. If any runner drops the baton, the whole team loses.

🥬 The Aha! (One Sentence): FINCH tests AI agents on end-to-end, real-company finance workflows—spanning messy spreadsheets, emails, PDFs, formulas, charts, and multi-step tasks—and judges them like professionals would.

Three Analogies:

Escape Room: The clues (emails, PDFs) and locks (formulas, references) are scattered; success requires finding, connecting, and executing steps in order.
Lego Blueprint: Pieces (spreadsheets, charts) are mixed across boxes; you must read the manual (instructions), find parts across sets, and assemble correctly.
Kitchen Dinner Service: Multiple dishes (tasks) must be prepped in parallel, plated in order, and delivered on time; burning one pan ruins the meal.

Before vs. After:

Before: Benchmarks were neat, single-file, single-step puzzles; AIs looked sharp but broke down in real offices.
After: FINCH shows true office complexity. Results drop below 40% pass—even for top models—pinpointing where agents fail and where to improve (retrieval, formulas, structure-aware edits, multimodal reading).

Why It Works (Intuition):

Real Sourcing: Using authentic emails, spreadsheets, PDFs, and version histories preserves the very messiness that causes real errors.
Composite Workflows: Interleaving tasks exposes error accumulation; a tiny off-by-one range can poison later charts and summaries.
Flexible Judging: Human and LLM judges care about correctness, completeness, layout, formulas, and avoiding unnecessary edits—like real reviewers do.

Building Blocks (with Sandwich explanations for the key new concepts):

🍞 Hook: You know how a class project needs a plan—from finding sources to writing and presenting. 🥬 Workflow Construction Process: It’s the step-by-step way FINCH turns real artifacts into clear, testable tasks.

How it works: 1) Find real emails and versioned files, 2) Infer the underlying work steps, 3) Write clean instructions, inputs, and reference answers, 4) Tag task types and business areas.
Why it matters: Without a solid process, tasks would be vague, unfair, or not reproducible. 🍞 Anchor: An email says “revise 2002 allocations”; FINCH pairs the original and corrected spreadsheets and writes the precise instruction.

🍞 Hook: Imagine asking a super-smart helper to skim a giant binder to find the parts you need. 🥬 LLM-Assisted Discovery: Use a large language model to propose likely workflows from emails and version histories.

How it works: 1) Scan threads for goals and attachments, 2) Summarize intent, 3) Spot differences between file versions, 4) Draft candidate workflows.
Why it matters: It speeds up finding real tasks hidden in mountains of enterprise data. 🍞 Anchor: The LLM notices an email chain that mentions “correcting headcounts” and flags those files for human review.

🍞 Hook: Like teachers proofreading essays to make sure instructions are followed exactly. 🥬 Expert Annotation: Human experts verify, clean, and finalize each workflow.

How it works: 1) Check that inputs and references match the instruction, 2) Fix stray edits, 3) Ensure realistic scope, 4) Approve final task packs.
Why it matters: Without experts, tiny hidden mistakes or extra changes would make evaluations unfair. 🍞 Anchor: An expert confirms the “fix totals” task only changes the affected cells and nothing else.

🍞 Hook: Think of a fair judge who looks at both the answer and how you got there. 🥬 Automated Evaluation Pipeline: A program that grades outputs using diffs, screenshots, and formula checks, aligned with human rubrics.

How it works: 1) Compute differences from input to reference and input to model, 2) Build compact snapshots, 3) Capture screenshots for layout, 4) Ask a judge LLM for pass/fail with reasons.
Why it matters: Automates scale while catching layout and formula issues humans can miss or vice versa. 🍞 Anchor: The judge flags if formulas were replaced by static numbers, even if the values look right.

🍞 Hook: A scrapbook mixes photos, captions, charts, and stickers. 🥬 Multimodal Artifacts: FINCH includes spreadsheets, PDFs, images, charts, and text.

How it works: 1) Combine multiple file types, 2) Keep layout context, 3) Require cross-artifact reasoning.
Why it matters: Real finance work is not just tables; ignoring modes loses crucial meaning. 🍞 Anchor: A PDF chart of debt levels must be summarized into a spreadsheet and graphed correctly.

🍞 Hook: Planning a fair booth takes setup, pricing, signs, change-making, and end-of-day tallies. 🥬 Composite Workflows: Each FINCH item weaves multiple tasks into one job.

How it works: 1) Import/structure, 2) Retrieve across sheets/files, 3) Calculate/model, 4) Visualize/report.
Why it matters: Single-task tests miss the real difficulty: errors stack across steps. 🍞 Anchor: A task imports sales from PDFs, fixes layouts, computes net value, and builds the final report.

Bottom Line: FINCH’s key idea is simple but powerful—evaluate AI the way real offices actually work, and you’ll quickly see what needs to improve for trustworthy, day-to-day finance automation.

03Methodology

At a high level: Real artifacts (emails, spreadsheets, PDFs) → [Discover workflows] → [Expert finalize: instructions + inputs + reference outputs] → [Tag tasks/business types] → [Evaluate: humans + automated judge] → Output pass/fail and insights.

Step A: Gather and Discover from Real Enterprise Data

What happens: FINCH pulls from Enron emails (500k messages with attachments), 15k spreadsheets, EUSES financial sheets, plus public orgs (e.g., World Bank) and firms. An LLM scans emails to find clear goals tied to spreadsheets (e.g., “revise 2002 allocations,” “update RAC rankings”), and analyzes version histories to infer changes (e.g., assumption updates, error fixes).
Why this step exists: Real work lives in threads and versions; this preserves authentic messiness (cross-references, inconsistent formatting, formula logic).
Example: An email thread attaches initial and corrected ranking files. The LLM proposes a workflow: “Fix mis-sorted departments and recompute totals.”
What breaks without it: We’d get toy problems that don’t reflect genuine enterprise complexity.

Step B: Expert Annotation and Normalization

What happens: Experts rewrite conversational requests into precise instructions; they verify that the input and reference files align exactly with those instructions, trimming any unrelated changes. For partially grounded cases (missing attachments), they reconstruct the minimal inputs/outputs needed.
Why this step exists: Ensures fairness, realism, and reproducibility—critical for trustworthy benchmarking.
Example: “Update 2002 allocations” becomes a clear task: “In Allocations.xlsx, Sheet ‘2002’, correct the department splits per the attached schedule; do not modify other years.”
What breaks without it: Hidden, out-of-scope edits or vague directions would lead to confusing and unfair grading.

Step C: Unify into a Common Schema and Tagging

What happens: Each workflow is packaged with: natural-language instruction, input files, reference outputs, and tags for task types (e.g., data entry, structuring, calculation, validation, cross-file retrieval, modeling, visualization, translation, web search) and business types (e.g., reporting, trading and risk, planning/budgeting, pricing/valuation, operations).
Why this step exists: Consistency enables large-scale evaluation and comparisons across tasks and domains.
Example: A modeling task from an investment spreadsheet is tagged as “financial modeling + visualization” within “asset management.”
What breaks without it: Results would be incomparable, and research could not target weaknesses by category.

Step D: Human Evaluation (Gold Standard)

What happens: Trained annotators open the instruction, input, reference, and model outputs side by side and judge pass/fail. They check completeness (did it do all steps?), correctness (numbers/formulas/layout), and over-edit avoidance (no unnecessary changes), with flexibility for multiple valid solutions.
Why this step exists: Real finance has many acceptable ways to format, calculate, or summarize; human judgment respects this nuance.
Example: Two formulas that compute the same value differently are both okay if they are correct and aligned with instructions.
What breaks without it: Strict, exact matches would unfairly fail correct but differently formatted solutions.

Step E: Automated LLM-as-Judge Pipeline (Scale with Care)

What happens: For modification tasks, the pipeline computes diffs from input→reference and input→model, then builds compact snapshots (keep first/last rows/cols and all edited regions) and takes screenshots to capture merged cells, charts, and layout. For generation tasks, it extracts all values and formulas and screenshots every sheet. The judge LLM applies the same rubric and outputs pass/fail with a short reason.
Why this step exists: It scales evaluation and often catches subtle spreadsheet issues (e.g., formulas replaced by constants) that humans might miss in a GUI.
Example: The judge flags that a chart’s series swapped axes compared to the instruction, even though the numbers look similar.
What breaks without it: Human-only grading would be too slow; purely automatic string compares would miss layout, formula semantics, and acceptable alternatives.

Step F: Agent Evaluation Setups

Product-side agents: ChatGPT GPT 5.1 Pro and Claude Sonnet 4.5 in their native environments (enabled web browsing, independent runs). These agents iterate: inspect sheets, run tools, and refine outputs.
API-based models: GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro Preview, Grok 4, Qwen 3 Max. Single-call, code-as-action paradigm: the model must emit a complete Python script (e.g., using openpyxl/pandas/matplotlib) to produce the output. Inputs include a rich spreadsheet encoding (cell addresses, types, formulas) and multimodal content (native PDFs for some models, screenshots/text extractions for others). Context management truncates oversized inputs while warning the model.
Why this step exists: Compares interactive, tool-rich agents with single-call code-generation agents to reveal which abilities (iteration, feedback, multimodality) matter most.
Example: An API model generates a script to pull values across sheets and recompute net value, then saves a new workbook with charts.
What breaks without it: We wouldn’t know if performance gaps come from reasoning ability versus missing interaction loops or tool access.

Step G: Quality Control and Statistics

What happens: Inter-annotator checks, product-side agents used as secondary validators, and LLM-based flags for possible alignment issues. The dataset ends up with 172 workflows, 1,710 spreadsheets (27M cells), plus PDFs/images/docs, covering calculation, structuring, data entry/import, validation, cross-file retrieval, summarization/visualization, modeling, web search, and translation.
Why this step exists: Maintains dataset integrity under high complexity.
Example: If a reference workbook contains a stray change outside the instruction, it’s corrected before release.
What breaks without it: Leaky or inconsistent references would make scores meaningless.

The Secret Sauce:

Ground-truth messiness: Email threads and version histories preserve authentic goals and edits.
Structure-aware encoding: Cell addresses, types, and formulas keep the logic visible to models.
Judge that “sees”: Diffs plus screenshots let the evaluator reason about both numbers and layout.
Composite design: Multiple interdependent steps expose error accumulation, the real office challenge.

In short, FINCH is a careful recipe: find realistic tasks, clarify them, keep their messy details, and then grade in a way that respects real office standards.

04Experiments & Results

The Test: FINCH measures whether AI agents can complete whole finance workflows—not just answer a question. It checks: did the agent finish all required steps, get numbers and formulas right, avoid touching unrelated parts, and present readable results? Time per workflow and breakdowns by task type and number of steps show where models struggle.

The Competition: Two product-side agents (ChatGPT GPT 5.1 Pro, Claude Sonnet 4.5) and five API-based models (GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro Preview, Grok 4, Qwen 3 Max) were tested. Product agents can iterate and use tools; API agents must solve the task in one code-generation shot.

The Scoreboard (with context):

GPT 5.1 Pro (product agent): 38.4% pass by human evaluation, spending about 16.8 minutes per workflow. Think of this like getting a solid C+ on a very tough, realistic exam where many small traps exist.
Claude Sonnet 4.5 (product agent): 25.0% pass. More like a D+ on the same exam—competent in parts, but tripped by multi-step composition and structure-heavy tasks.
Automated judging (LLM-as-judge) is slightly more generous: GPT 5.1 Pro at 41.9% and Claude at 29.1%, and agrees with humans 82–90% of the time (with good recall and decent precision). That’s like a video replay that usually matches the referee but occasionally misses a foot fault or calls one too strictly.

By Number of Tasks (compositionality bites hard):

GPT 5.1 Pro’s pass rate drops from about 44.3% (≤2 tasks) to 23.5% (>2 tasks). Longer chains mean more chances to drop the baton.
Average time rises with steps (about 13–21 minutes), except some web-heavy 5-step workflows fail quickly, suggesting premature exits when web search encounters ambiguity.

By Task Type (messiness and structure are hard):

Data Entry/Import and Structuring/Formatting are among the toughest. Irregular layouts, merged cells, and nonstandard schemas defeat naive code. When imports involve web search or PDFs, the multimodal hurdles multiply.
Translation surprisingly underperforms compared to general NLP settings because preserving layout, header hierarchies, and alignment across large grids is hard; omissions or misalignments cause failure.
Cross-sheet/file retrieval and formula reasoning cause many errors; a single off-by-one range or misunderstood formula propagates wrong answers.

API vs. Product Agents:

API baselines (single-shot code) underperform product agents, largely because they can’t iterate, inspect, and fix. Still, rich encodings and careful prompts narrow the gap somewhat (e.g., GPT 5.1 via API around 32.0%).
This shows iteration with feedback and tool use are key ingredients for success in enterprise-grade spreadsheet work.

Surprising Findings:

Even with modern strengths in translation and reasoning, AIs falter when translations must preserve exact spreadsheet structure at scale.
The automated judge occasionally catches subtle spreadsheet issues that humans miss (e.g., formulas silently replaced by constants), but can also be overly literal in edge cases; hence human review remains crucial.

Takeaway: On this real-world, long-horizon stage, today’s best AIs pass fewer than 4 in 10 workflows. The big enemies are composition, structure, formulas, and multimodal grounding—exactly what offices rely on daily.

05Discussion & Limitations

Limitations:

Coverage vs. size: FINCH has 172 workflows—broad and realistic—but still a finite slice of enterprise life. It may not cover every industry nuance, rare file formats, or edge-case spreadsheet features.
Judging ambiguity: While humans and the LLM judge use a clear rubric and allow multiple valid solutions, borderline cases remain. Automated scores can drift a few points from human gold.
Tooling constraints: Some corrupted or exotic files are hard to parse programmatically; a valid human-readable workbook might fail automatic processing.
Single-shot API setting: The API baselines lack iterative retries and execution feedback, so their scores reflect raw one-shot coding, not full agentic potential.

Required Resources:

Computing: Handling large, formula-heavy, multi-sheet workbooks plus PDFs and images needs memory and time (product agents average ~17 minutes per workflow).
Tooling: Spreadsheet libraries (openpyxl/pandas), PDF parsers, screenshot renderers, and robust diffing are essential.
Expertise: Domain-aware annotators and evaluators keep instructions precise and references clean.

When NOT to Use:

If your goal is pure NLP without artifacts (no spreadsheets, no PDFs), FINCH is overkill.
If you need regulated, private data only, FINCH’s public sources won’t capture proprietary systems or policies.
If you require instant micro-latency decisions, these long workflows and judgments won’t match that constraint.

Open Questions:

How to build agents that reflect, inspect, and repair mid-flight (e.g., verify ranges, sanity-check formulas) before errors snowball?
Can models read and reason over formulas as first-class citizens, not just values, to recover hidden business logic?
What training curricula best teach structure-aware editing, over-edit avoidance, and cross-file retrieval at scale?
How to robustly integrate multimodal signals (PDF tables, charts, images) while preserving layout semantics?
Can evaluation better distinguish harmless differences (equivalent formulas/layouts) from risky mistakes (broken links, lost formulas)?

06Conclusion & Future Work

Three-Sentence Summary: FINCH is a finance-and-accounting benchmark built from real enterprise artifacts—emails, spreadsheets, PDFs, charts—that tests AI agents on multi-step, spreadsheet-centric workflows with human- and LLM-based judging. On this realistic stage, even frontier systems pass fewer than 40% of workflows, revealing hard problems in cross-file retrieval, formula reasoning, structure-aware editing, and multimodal grounding. FINCH provides a rigorous foundation for improving office-ready AI.

Main Achievement: Turning genuine enterprise messiness into well-annotated, end-to-end workflows with fair judging—so models are evaluated the way professionals actually work.

Future Directions: Develop agents that verify and repair mid-task, read formulas as logic, plan across steps with memory, and ground decisions across spreadsheets, PDFs, and charts. Improve encodings, tool APIs, and training data that emphasize structure and over-edit avoidance. Refine automated judges to align even more tightly with expert standards.

Why Remember This: If you want AI to help with real office spreadsheets—not toy tables—FINCH is the stress test that matters. It shows exactly where current systems stumble and points the shortest path to trustworthy, everyday finance automation.

Practical Applications

•Evaluate enterprise AI co-pilots on realistic finance workflows before deployment.
•Stress-test spreadsheet agents for cross-sheet retrieval, formula safety, and over-edit avoidance.
•Train agents with curriculum derived from FINCH weak spots (e.g., structure-aware edits, multimodal imports).
•Benchmark improvements to spreadsheet encodings that preserve cell addresses, types, and formulas.
•Prototype interactive agents that verify ranges and formulas mid-task and self-correct.
•Use the LLM-as-judge pipeline as a QA harness for internal spreadsheet automation.
•Design onboarding exercises for analysts to practice robust, multi-step spreadsheet work.
•Audit critical financial models by replaying workflows and checking for silent formula-to-constant swaps.
•Compare product-side agents to API-based code generators to pick the right architecture.
•Create teaching modules for accounting/finance students on real-world spreadsheet workflows.

Version: 1