DocDancer: Towards Agentic Document-Grounded Information Seeking
Key Summary
- •DocDancer is a smart document helper that answers questions by exploring and reading long, mixed-media PDFs using just two tools: Search and Read.
- •The paper turns Document Question Answering (DocQA) into an information-seeking mission, where the agent plans, looks for clues, and double-checks facts.
- •It fixes the lack of training data by first exploring documents to collect grounded evidence, and then synthesizing hard, multi-step question–answer pairs.
- •DocDancer is trained end-to-end on only 5,000 synthetic examples yet performs competitively with much larger or closed models.
- •On two tough benchmarks (MMLongBench-Doc and DocBench), it beats many OCR, RAG, and prompt-based agent systems.
- •The agent uses a simple, tool-light design (just Search and Read), but learns to chain them iteratively for complex, cross-page reasoning.
- •A stronger document outline (via a better parser) and the two-tool design both significantly boost accuracy, especially on multi-page questions.
- •Synthetic data made with the Exploration-then-Synthesis pipeline outperforms training on available human-written QA in the same PDFs.
- •Even small open-source backbones (4B and 30B) learn solid agentic behaviors after fine-tuning on the synthetic dataset.
- •The work offers insights on agent tool design, document parsing quality, and how to build robust DocQA agents with minimal tools and data.
Why This Research Matters
Long PDFs are everywhere—reports, contracts, manuals, textbooks—and people waste time hunting for scattered facts. DocDancer shows a simple, learnable way to find, combine, and verify information across text, tables, charts, and pages. Because it trains end-to-end, it develops dependable habits for when to search, when to read deeply, and how to synthesize results. Its synthetic training data is practical to produce and outperforms available human-written QA in the same sources. With only two tools, teams can deploy document agents that are easier to maintain and explain. This can speed up work in finance, law, research, customer support, and education, leading to fewer mistakes and faster, grounded answers.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you have a huge scrapbook with words, pictures, charts, and tables spread across dozens of pages. If someone asks you a detailed question—like a ratio that needs numbers from different pages—you wouldn’t just read the first page and guess. You’d flip around, search, and piece things together.
🥬 The Concept: Document Question Answering (DocQA) is when AI answers questions using the content inside documents. How it works:
- You give the AI a question and a document.
- The AI finds the right parts (text, images, tables).
- It reasons over them to build an answer. Why it matters: Without DocQA, AI either guesses or requires a human to extract the right parts first, which fails on long, complex PDFs. 🍞 Anchor: If you ask, “What was the 2015 advertising expense to sales ratio?” the AI must find the expense number in one section and the sales in a different table, then compute the ratio.
🍞 Hook: You know how big chapter books are hard to skim? Now imagine a book made of text plus charts and photos—way trickier.
🥬 The Concept: Long, multi-modal documents mix text with visuals (tables, charts, images) across many pages. How it works:
- Text and layout are connected (headings, captions, footnotes).
- Useful facts can hide in tables or charts.
- Answers often require hopping across pages and formats. Why it matters: A single pass misses cross-page clues; complex structure confuses systems that expect plain text. 🍞 Anchor: A financial report might state a claim in text but give exact numbers in a distant table.
🍞 Hook: Think of typing all the book pages into a note app and searching by keywords. Fast, yes—but do you miss charts?
🥬 The Concept: OCR pipelines turn pages into text for keyword search. How it works:
- Convert page images to text.
- Search tokens to retrieve matches.
- Answer using the found text. Why it matters: Without visuals and layout, you miss charts, tables, and their context; errors snowball on long docs. 🍞 Anchor: If the expense is only in a table image, OCR-only search might never see it.
🍞 Hook: Imagine asking a librarian for help. They grab likely pages and let you read them—convenient, but they might pick the wrong ones.
🥬 The Concept: Retrieval-Augmented Generation (RAG) fetches relevant chunks and feeds them to a language model. How it works:
- Index the doc.
- Retrieve top-k chunks for a query.
- Generate an answer from these chunks. Why it matters: If retrieval misses one key chunk, the answer is wrong; one-shot retrieval struggles on multi-step, cross-page tasks. 🍞 Anchor: If revenue is on one page and advertising on another, but RAG retrieves only one, the ratio is impossible to compute.
🍞 Hook: When you solve a scavenger hunt, you take steps: look around, read a clue, adjust your plan, and keep going.
🥬 The Concept: Agent-based DocQA means the model plans, takes actions with tools, observes results, and iterates. How it works:
- Plan what to look for first.
- Use tools (like Search) and observe.
- Update the plan and dig deeper (like Read sections).
- Stop when enough evidence is gathered and synthesize an answer. Why it matters: Without iterative actions, the model can’t adapt when the first try misses something. 🍞 Anchor: The agent searches for “advertising,” reads the section to confirm the number, then searches and reads for “revenue,” then computes the ratio.
🍞 Hook: Training a helper to do the right steps is easier if it watches lots of good examples.
🥬 The Concept: End-to-end training teaches the agent to plan and act across the whole process, not just final answers. How it works:
- Show the model full trajectories (thoughts, tool calls, observations, answers).
- Optimize it to replicate successful behavior.
- Mask feedback tokens so it learns decisions, not to copy past text. Why it matters: Without end-to-end learning, agents rely on brittle prompts and don’t develop robust habits. 🍞 Anchor: The model learns when to Search first versus Read next based on prior successful traces.
🍞 Hook: If a teacher wants to train you on tricky questions but there aren’t enough in the book, they can write new, quality problems.
🥬 The Concept: Synthetic data generation creates new, grounded Q&A from documents so agents can practice. How it works:
- Explore documents to collect evidence.
- Synthesize questions that require multi-step reasoning.
- Check and filter for quality. Why it matters: Without good training pairs, agents don’t learn reliable exploration or synthesis. 🍞 Anchor: Create a question that needs numbers from a chart and a table on different pages, plus a short calculation.
The gap this paper fills: Earlier systems used OCR or single-shot retrieval, or depended on closed, prompt-heavy agents. High-quality training data for long, visual documents was scarce. DocDancer introduces a simple, powerful two-tool agent, trained end-to-end on synthetic, document-grounded trajectories. Real stakes: People read long PDFs every day—reports, contracts, manuals. A reliable AI reader saves time, reduces mistakes, and unlocks insights that are hard to piece together by hand.
02Core Idea
🍞 Hook: You know how a good treasure hunter carries just a map and a magnifying glass—and still finds everything? Fewer tools, used smartly, beat a messy toolkit.
🥬 The Concept: Key insight in one sentence: Train an agent to seek information inside documents using just two complementary tools—Search (global clues) and Read (local understanding)—and give it synthetic, grounded practice to master the dance from exploration to answer synthesis. How it works (intuition first):
- Treat every question as a mini-expedition (plan → act → observe → adjust).
- Use Search to quickly scan the whole doc for promising spots.
- Use Read to deeply understand selected sections, including text, tables, images, and page layout.
- Chain steps until enough evidence is gathered, then synthesize a precise, grounded answer. Why it matters: Over-complicated toolkits can confuse models; a minimal, well-designed pair fosters reliable strategies that generalize. 🍞 Anchor: For “2015 ad expense to sales ratio,” Search finds where “advertising expenses” is discussed; Read extracts the exact 6,779.511M revenue; the agent then computes 0.105.
Three analogies for the same idea:
- Detective: Search is scanning the room; Read is putting a fingerprint under the microscope; verdict is the answer.
- Librarian: Search locates the right shelves; Read skims chapters, figures, and tables to collect facts; the report is the final synthesis.
- Map + Magnifying Glass: Search is your map for where to go; Read is the magnifying glass to examine details; the route summary is the answer.
🍞 Hook: Before GPS, you’d ask, look around, then walk closer; after GPS, you still look carefully once you arrive.
🥬 The Concept: Before vs After this idea. How it works:
- Before: OCR or one-shot retrieval misses visuals or cross-page dependencies; prompt-based agents lack learned habits.
- After: A trained agent reliably sequences Search→Read→reasoning, excels at multi-hop, cross-modal evidence use.
- Before: Data scarcity blocks training; After: Exploration-then-Synthesis manufactures high-quality, grounded practice. Why it matters: It moves from brittle scripts to learned, adaptive strategies that scale to long, messy PDFs. 🍞 Anchor: In domain-wise tests (like Finance or Reports), the trained agent handles tables and charts better than text-only or single-shot methods.
🍞 Hook: Imagine practicing piano: you first explore the sheet music, then perform the piece. Practice pieces should be challenging but fair.
🥬 The Concept: Exploration-then-Synthesis is a data pipeline that first gathers document evidence through tool-augmented exploration, then synthesizes multi-step, visual-plus-text questions. How it works:
- Exploration: Iteratively Search and Read with explicit intents to collect grounded snippets, tables, figures, and layouts.
- Synthesis: Build questions that require combining multiple observations, often with light computation.
- Filtering: Use a strong model to keep only high-quality, grounded Q&A. Why it matters: Without exploration-guided evidence, synthetic Q&A becomes shallow; with it, agents learn real document reasoning. 🍞 Anchor: A generated question might ask to verify a textual claim using a chart and a table across different pages, ensuring multi-hop reasoning.
🍞 Hook: A great coach doesn’t just tell you the answer; they help you learn the steps to get it next time.
🥬 The Concept: End-to-end training with observation masking teaches the agent to make good decisions, not to parrot tool outputs. How it works:
- Train on trajectories of thoughts, actions, observations, answers.
- Mask loss on observation tokens so the model focuses on choosing actions and forming correct reasoning.
- The result is a policy that knows when to Search, when to Read, and when to stop. Why it matters: Without this, the agent overfits to seen outputs and struggles to generalize its decision-making. 🍞 Anchor: The model learns patterns like “If Search returns multiple hits for ‘advertising,’ Read the one mentioning expenses by year.”
Building blocks:
- A robust outline built from high-precision parsing (so sections, captions, and visuals are well-indexed).
- A minimal, complementary toolset (Search for global, Read for local multimodal comprehension).
- An exploration policy (plan–act–observe–revise) guided by intents.
- A synthesis step that fuses observations into concise, grounded answers.
- End-to-end training on synthetic but realistic trajectories, with observation-masked loss to emphasize choices over echoes.
03Methodology
High-level recipe: Input (long, multi-modal PDF + question) → Document Processing (build a reliable outline) → Agent Loop (plan → Search → Read → revise) → Evidence Fusion (reason over text, tables, images) → Answer.
🍞 Hook: Imagine reorganizing a messy binder so you can find what you need fast.
🥬 The Concept: Document processing builds a clean outline so the agent can navigate text, tables, figures, and pages. How it works:
- Run a strong parser (e.g., MinerU2.5) to detect 17 element types (paragraphs, images, tables, etc.).
- Infer a hierarchy (sections/subsections) from visual cues like titles and layout.
- Attach captions and visual descriptions so images/charts are searchable. Why it matters: If the outline is sloppy, Search returns noisy hits and Read misses key visuals. 🍞 Anchor: When the agent searches for “advertising,” it can jump straight to the paragraph with the exact yearly numbers.
🍞 Hook: Solving hard puzzles means you think, try, look, think again, and continue.
🥬 The Concept: ReAct-style agent loop interleaves reasoning and actions. How it works:
- Think: Decide what to look for first (e.g., the numerator of a ratio).
- Act: Use Search with keywords to get global hits.
- Observe: See which sections/pages look promising.
- Think again: Narrow the goal.
- Act: Use Read on specific section IDs to extract detailed, multimodal content.
- Repeat until evidence is sufficient; then synthesize the answer. Why it matters: Without looping, the agent can’t fix course when early guesses are off. 🍞 Anchor: After searching “Revenues,” the agent reads the Consolidated Statements table to confirm the exact 2015 number and its units.
🍞 Hook: Two tools can beat a drawer full of gadgets if you know exactly when and how to use them.
🥬 The Concept: Search and Read are the only tools, designed to cover global discovery and local comprehension. How it works:
- Search: Keyword-based, returns section IDs, page numbers, and short snippets for fast localization.
- Read: Given a goal and section IDs, extracts text, tables, images, and a page screenshot; a multimodal summarizer fuses them into goal-relevant content. Why it matters: Too many tools increase confusion; these two cover breadth and depth efficiently. 🍞 Anchor: Search finds the paragraph naming “advertising expenses,” and Read pulls the precise $714.3M and the matching revenue from another section.
🍞 Hook: Good practice problems make you strong; great practice problems make you unstoppable.
🥬 The Concept: Exploration-then-Synthesis data pipeline creates high-quality training trajectories. How it works:
- Exploration: With explicit intents, the agent alternates Search and Read to collect grounded observations across pages and modalities.
- Synthesis: Another model turns the observation bundle into a natural, unambiguous, multi-step Q&A pair.
- Filtering: A strong judge model performs rejection sampling to keep only high-quality items. Why it matters: Training on shallow Q&A won’t teach multi-hop reasoning or cross-modal grounding. 🍞 Anchor: A generated question might require reading a narrative claim, then verifying it against a chart and a table, with a short calculation.
🍞 Hook: Coaches don’t want you to memorize the scoreboard; they want you to learn the plays.
🥬 The Concept: Observation-masked loss trains decisions (thoughts/actions), not the external tool text. How it works:
- Compute loss only on the agent’s thoughts and actions.
- Ignore loss on raw observations returned by tools.
- Encourage a policy that generalizes its decision-making. Why it matters: Otherwise, the model might overfit to seen snippets rather than learn when and where to look next. 🍞 Anchor: The agent learns a reusable habit: find entity A in text, confirm entity B in a table, then combine.
Step-by-step example with real data:
- Input: Question: “What is advertising expense to sales ratio of Netflix in FY 2015?”
- Document Processing: Build an outline that links the Marketing paragraph and the Operations table.
- Agent Loop:
- Think: Need numerator (ad expense) and denominator (revenue).
- Search: “advertising” → hit in a paragraph with exact yearly expenses.
- Read: That section → confirms $714.3M (2015).
- Search: “Revenues” → multiple hits including the Statements of Operations.
- Read: The statements table → 6,779.511M.
- Compute: 714.3 / 6,779.511 ≈ 0.105; round to three decimals.
- Output: 0.105, grounded by two independent sources.
Secret sauce:
- Minimal toolset with maximal coverage (global + local, text + visuals).
- High-quality outlines that respect layout and captions.
- Synthetic training that enforces multi-page, multi-modal, multi-step reasoning.
- End-to-end learning that bakes strategy into the model rather than relying on fragile prompt chains.
04Experiments & Results
🍞 Hook: If a team scores 87 points, is that good? It depends—if everyone else is scoring 60, that’s excellent; if they’re scoring 95, not so much.
🥬 The Concept: Benchmarks and metrics tell us how well DocDancer does compared to others. How it works:
- Benchmarks: MMLongBench-Doc (long, visual PDFs across domains) and DocBench (real-world docs and tasks).
- Metrics: Accuracy and F1 (overlap with ground truth), plus an LLM-as-Judge (LasJ) score to fairly judge varied answer formats.
- Baselines: OCR-only, VLM-only, RAG, and prompt-based agents. Why it matters: Numbers mean something only with context—who else played, under what rules. 🍞 Anchor: On DocBench, DocDancer with strong backbones tops many baselines and even meets or exceeds reported human-level performance in some settings.
The test setup:
- Models: Open-source Qwen3-4B and Qwen3-30B-A3B fine-tuned on 5,000 synthetic DocDancer trajectories; also tested with proprietary LLMs for upper bound.
- Tasks: Cross-page, multimodal DocQA with text, tables, charts, and screenshots.
- Evaluation: Official evaluation kits; LLM-as-Judge to reduce extraction-format bias.
The competition (who’s on the field):
- VLM-only: feed page images to a vision-language model.
- OCR-based: strip text and ask a language model.
- RAG-based: retrieve chunks; generate answers.
- Prompt-based agents: tool-augmented but not trained end-to-end; often closed-source backbones.
Scoreboard with context:
- On MMLongBench-Doc, DocDancer (Qwen3-30B-A3B ft) achieves about 54.4 Accuracy / 53.9 F1 / 65.3 LasJ, outperforming many baselines and prompt agents.
- On DocBench, the same open-source DocDancer reaches about 81.2 LasJ; stronger proprietary backbones push even higher (e.g., 85.5), matching or exceeding reported human-level results in some setups.
- Even the small 4B variant (trained on the same 5,000 examples) performs strongly (e.g., near 80 LasJ on DocBench), showing data efficiency and effective behavior learning.
- Domain-wise gains are largest where structure is complex (Finance, Reports), highlighting cross-page, cross-modal strengths.
Surprising/insightful findings:
- Minimal tools win: Two well-designed tools (Search, Read) beat more complicated toolkits in ablations.
- Better parsing matters: Using a stronger parser to build the outline consistently improves results under the same tools.
- Synthetic beats OS-QA: Training on the Exploration-then-Synthesis data outperforms training on available human-written QA from the same PDFs, across both benchmarks.
- Reader model swap: Changing the Read tool’s multimodal summarizer to a stronger external model gives only modest gains, suggesting the tool design—more than sheer backbone power—is key.
Why this is meaningful:
- Compared to OCR-only (like reading with your eyes closed to charts) or single-shot RAG (one quick guess), a trained agent that iterates with Search/Read is more reliable on the messy reality of long PDFs.
- Matching or surpassing human-level results in some DocBench settings with a simple toolset points to a robust, general approach, not just overfitting tricks.
05Discussion & Limitations
🍞 Hook: Even the smartest backpack has limits—it can’t carry a house.
🥬 The Concept: Limitations and tradeoffs explain where DocDancer shines and where caution is needed. How it works:
- Model scope: Experiments focus on specific open-source families (Qwen3-4B, Qwen3-30B-A3B); results may vary for other backbones.
- Training style: Uses supervised fine-tuning; agentic reinforcement learning might further boost planning but wasn’t explored.
- Data scale: Only 5,000 synthetic trajectories were used; more or broader data could yield additional gains—and new failure modes to address.
- Tool dependence: Although minimal and strong, success assumes access to a good parser and stable Search/Read infrastructure. Why it matters: Knowing boundaries helps teams deploy wisely and prioritize future work. 🍞 Anchor: If a deployment lacks high-quality parsing (bad outlines, missing captions), Search may mislead and Read may miss visuals, reducing accuracy.
Required resources:
- A reliable parser for building outlines (like MinerU2.5 or similar-quality alternatives).
- Compute to fine-tune large-context models and to run the Read tool’s multimodal summarizer.
- Document storage and a fast search index.
When not to use:
- Extremely short, plain-text documents where a simple text reader suffices; the agent loop may be overkill.
- Ultra-tight latency settings that can’t afford iterative Search→Read steps.
- Scenarios requiring domain-specific tools (e.g., OCR of handwritten forms) beyond the two-tool design without customization.
Open questions:
- How far can end-to-end agents improve with agentic RL or curriculum learning on progressively harder synthetic tasks?
- What are the best practices for scaling synthetic data—more documents, more diverse domains, or more complex multi-hop patterns?
- Can we further compress or distill the agent to smaller backbones without sacrificing multi-modal, cross-page reasoning?
- How to enhance faithfulness checks and reduce subtle grounding errors when charts and tables partially disagree or use tricky footnotes?
06Conclusion & Future Work
Three-sentence summary:
- DocDancer reframes DocQA as an information-seeking mission and trains an end-to-end agent to solve it with just two tools—Search for global discovery and Read for local, multimodal comprehension.
- A new Exploration-then-Synthesis pipeline manufactures high-quality, grounded, multi-step Q&A data so the agent learns real strategies, not brittle prompts.
- With only 5,000 synthetic examples, DocDancer achieves strong, often state-of-the-art results on long-document benchmarks, offering a simple, robust blueprint for document-grounded agents.
Main achievement: Showing that a minimal, tool-driven design—paired with targeted synthetic training—can reliably handle cross-page, cross-modal reasoning in long, real-world PDFs and rival or exceed closed, prompt-heavy systems.
Future directions:
- Add agentic reinforcement learning and curricula to strengthen planning, verification, and recovery from errors.
- Scale synthetic data across more domains and reasoning patterns; probe transfer and robustness.
- Explore smaller backbones via distillation, and larger-context models for massive documents.
- Tighten faithfulness checks and calibration, especially for numeric and footnote-constrained tasks.
Why remember this: DocDancer proves you don’t need a toolbox full of hammers—just a smart map (Search), a sharp magnifying glass (Read), and the right practice. With that trio, document agents can dance through long, visual PDFs and consistently land on precise, well-grounded answers.
Practical Applications
- •Financial analysis: Extract key metrics across text and tables to compute ratios, trends, and cross-validated summaries.
- •Legal review: Locate clauses, exceptions, and footnotes across sections and verify consistency with exhibits and tables.
- •Enterprise reporting: Answer multi-page questions from compliance, risk, or audit reports while citing evidence.
- •Technical manuals: Combine procedure text with diagrams to guide troubleshooting and confirm part specifications.
- •Scientific literature: Cross-check narrative claims with figures and tables for meta-analysis or peer review support.
- •Customer support: Resolve complex “how-to” queries from user guides by combining steps, warnings, and diagrams.
- •Procurement and contracts: Compare terms across appendices and tables, highlighting differences and dependencies.
- •Education: Generate grounded homework/exam questions that require multi-step reasoning over textbook chapters.
- •Healthcare administration: Pull policy details and tabular criteria from long guidelines for eligibility checks.
- •Policy analysis: Verify policy statements against charts and statistical tables across multi-agency documents.