SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature
Key Summary
- •The paper introduces SIN-Bench, a new way to test AI that read long scientific papers by forcing them to show exactly where their answers come from.
- •Instead of hiding a fake 'needle' in random text, the authors use real papers where text and figures are interleaved, like how humans actually read science.
- •The core idea is Fish-in-the-Ocean (FITO): models must catch multiple 'fish' (real pieces of evidence) across the paper and connect them into an evidence chain before answering.
- •They build SIN-Data (clean, interleaved papers) and four tasks in SIN-Bench: find evidence (SIN-Find), check if evidence really supports a claim (SIN-Verify), answer with proof (SIN-QA), and summarize with citations (SIN-Summary).
- •Scoring follows 'No Evidence, No Score': correct answers without verifiable anchors don’t count much.
- •Across eight multimodal models, Gemini-3-pro has the best overall grounded score (0.566), while GPT-5 gets the best raw answer accuracy on SIN-QA (0.767) but loses points when evidence is weak.
- •Interleaving figures next to the text that cites them boosts performance a lot, showing that layout matters for real understanding.
- •Hard negative tests (tricky, near-miss evidence) drop verification accuracy for all models, proving that true logic checking is still very hard.
- •Open-weight models often fail to output well-structured evidence chains, showing that formatting and grounding are big challenges.
- •This benchmark nudges the field from 'guessing the right answer' to 'proving it with traceable, cross-modal evidence'.
Why This Research Matters
When AI reads scientific papers for us, we need more than confident answers—we need traceable proof from the paper itself. SIN-Bench encourages models to point to exact text and figures, making their reasoning visible and checkable. This improves trust for doctors, engineers, and researchers who rely on precise evidence. It also discourages hallucinations, since guesses without anchors won’t score well. Over time, training against this benchmark can make AI better at careful reading, honest citation, and disciplined logic. That’s key for safe AI assistance in science, medicine, and policy. Ultimately, it helps move AI from sounding smart to being verifiably correct.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: You know how when you do a science project, you can’t just say the result—you have to show your data, your steps, and your graphs? Teachers want proof, not guesses.
🥬 The Concept: Multimodal Large Language Models (MLLMs)
- What it is: MLLMs are AIs that can read and connect words, pictures, tables, and charts in one big story.
- How it works:
- Read long text.
- Look at figures and captions.
- Match text with visuals.
- Answer questions or write summaries.
- Why it matters: Without linking words to pictures, the AI can miss crucial facts hidden in figures or misread what the paper really shows. 🍞 Anchor: Imagine explaining a volcano paper: the AI must read the paragraph about lava flow and also interpret the graph showing temperature changes.
🍞 Hook: Imagine trying to remember a whole chapter and a bunch of diagrams at once. That’s hard even for people, and it’s also tough for AI.
🥬 The Concept: Long-context understanding
- What it is: Handling long documents where the important clues are spread out over many pages and images.
- How it works:
- Keep track of far-apart details.
- Connect methods to results and conclusions.
- Don’t forget earlier parts when reading later parts.
- Why it matters: If the AI forgets earlier details, it can answer wrongly or miss the logic chain. 🍞 Anchor: In a paper, the method might be on page 2 and the key result graph on page 9—they must be connected for a correct answer.
🍞 Hook: You know how in a textbook the figure is often right next to the paragraph that explains it? That makes learning easier.
🥬 The Concept: Interleaved scientific documents
- What it is: Papers where text and figures are placed in the order you’d read and use them together.
- How it works:
- Put each figure just before the paragraph that cites it.
- Keep the natural story flow (Abstract → Methods → Results → Discussion).
- Preserve references so you can jump from claim to evidence.
- Why it matters: If pictures are far from their explanations, the AI (and you!) can lose the thread. 🍞 Anchor: A paragraph says “Figure 3 shows…” and Figure 3 is right there—no hunting around.
🍞 Hook: Picture a game where someone hides a secret word in a giant pile of random pages. Finding it is a search trick, not real understanding.
🥬 The Concept: Needle-in-a-Haystack (NIAH) tests
- What it is: Old-style tests that hide a fake fact (the needle) inside lots of unrelated text (the haystack) to check retrieval.
- How it works:
- Insert synthetic “needle” sentences.
- Ask the model to find them.
- Score based on whether it retrieves the needle.
- Why it matters: It measures memory/search, but not deep reasoning or linking real evidence. 🍞 Anchor: Finding a random sentence isn’t the same as explaining a real figure from a real paper.
🍞 Hook: Now imagine swimming in the real ocean: there’s no fake needle—just real fish you must catch and connect into a meal.
🥬 The Concept: Fish-in-the-Ocean (FITO) paradigm
- What it is: A new way to test AI by having it find and connect real evidence (“fish”) across a real paper (“ocean”).
- How it works:
- Search the paper’s text and figures for relevant pieces.
- Chain them into a step-by-step trail (an evidence chain).
- Use the chain to answer, verify, or summarize.
- Why it matters: Without this chain, the AI might guess correctly but for the wrong reasons. 🍞 Anchor: To answer “Does the new method really help?”, the AI must cite the method section and the result figure that proves it.
🍞 Hook: If you tell your teacher “I got 92!”, they’ll say, “Show me your worksheet.”
🥬 The Concept: No Evidence, No Score
- What it is: A rule that says answers only count when they’re backed by traceable evidence from the paper.
- How it works:
- The model must point to exact text spans and figures.
- Judges check if those anchors are relevant and ordered logically.
- If there’s no valid evidence, the score is near zero even if the answer sounds right.
- Why it matters: It prevents lucky guesses and pushes real understanding. 🍞 Anchor: Saying “The accuracy is better” gets no credit unless you show the specific table or plot that proves it.
🍞 Hook: Imagine reorganizing your notes so every graph sits next to the paragraph that mentions it—way easier to study.
🥬 The Concept: SIN-Data
- What it is: A cleaned, interleaved collection of real scientific papers that preserves native text–figure order.
- How it works:
- Parse PDFs/LaTeX/XML into a unified format.
- Insert each figure right before its first citation.
- Filter for quality so anchors are precise and recoverable.
- Why it matters: Good data layout makes fair, realistic testing possible. 🍞 Anchor: Instead of a messy PDF, you get a tidy Markdown where “Figure x1” appears exactly where it’s discussed.
🍞 Hook: Think of a science fair rubric that checks not just your conclusion, but also whether your data actually supports it.
🥬 The Concept: SIN-Bench
- What it is: A four-part test suite that checks discovery, verification, grounded Q&A, and evidence-anchored summarization.
- How it works:
- SIN-Find: locate evidence.
- SIN-Verify: decide if evidence truly supports the claim.
- SIN-QA: answer with an evidence chain.
- SIN-Summary: summarize with per-claim anchors.
- Why it matters: It mirrors how scientists actually read and reason—find, check, answer, and synthesize. 🍞 Anchor: A model that can pass all four steps is closer to how a PhD student reads a paper.
The world before this work mostly rewarded answer-only scores and synthetic retrieval tricks. Those fail when the real job is linking methods to results and explaining figures inside long, native documents. The gap was a fair way to test whether models can show their work. This paper fills that gap with FITO, SIN-Data, and SIN-Bench—plus a strict “No Evidence, No Score” rule—so AI must prove its thinking just like a good scientist.
02Core Idea
🍞 Hook: You know how in math class the teacher says, “Don’t just give me the answer—show how you got it”? That way they know you truly understand.
🥬 The Concept: The Aha! Moment
- What it is: Evaluate answers and their proof together inside the same document, so the model must build an evidence chain before getting credit.
- How it works (three analogies):
- Detective analogy: The model can’t just name the suspect; it must show each clue (text span or figure) and how clues connect.
- Cooking analogy: The model can’t serve a cake without listing the exact ingredients (anchored evidence) in the right order.
- Museum guide analogy: The model must point at each painting (figure) and quote the label (text) as it explains the exhibition’s story.
- Why it matters: Without the trail, a model can “sound right” while actually guessing or relying on memory. 🍞 Anchor: A correct claim like “Model A beats Model B by 3%” only counts when the model cites the exact result table and the comparison setup.
🍞 Hook: Imagine upgrading a quiz from “multiple choice” to “explain your reasoning with page numbers and pictures.”
🥬 The Concept: Before vs After
- What it is: What changed because of this idea.
- How it works:
- Before: Tests favored finding planted facts or grading final answers.
- After: Tests demand native, cross-modal evidence chains that match the paper’s logic.
- Outcome: We can tell if the model truly understands, not just memorizes.
- Why it matters: This closes the gap between “sounds good” and “proven by the paper.” 🍞 Anchor: GPT-5 scored high on raw answer accuracy but lost points when its cited evidence didn’t really back the answer—showing the new test reveals hidden weaknesses.
🍞 Hook: Think of reasoning like building a bridge—you need the right pieces in the right order, or the bridge falls.
🥬 The Concept: Why it works (intuition)
- What it is: The evaluation ties the answer to a grounded process.
- How it works:
- First find candidate anchors (text spans and figures).
- Check each anchor matches the claim.
- Ensure the chain’s order makes logical sense from premise to conclusion.
- Why it matters: This step-by-step guardrail stops unsupported leaps and encourages careful reading. 🍞 Anchor: If a model claims “bigger batch size improved results” but its evidence chain never mentions batch size, it gets downgraded.
🍞 Hook: Imagine a Lego set with five bags: if you build bag-by-bag in the right order, your castle stands strong.
🥬 The Concept: Building blocks of the idea
- What it is: The small pieces that make the big idea work.
- How it works:
- FITO paradigm: hunt real evidence in native papers.
- SIN-Data: format papers so text and figures are interleaved by first citation.
- Four tasks: Find, Verify, QA-with-proof, and Summarize-with-anchors.
- Metrics (MRL): Matching, Relevance (F1), and Logic (order similarity).
- Rule: No Evidence, No Score.
- Why it matters: Each piece stops a different failure—missing anchors, cherry-picked snippets, or scrambled logic. 🍞 Anchor: Together, these blocks catch errors like “shotgun citations” (throwing many irrelevant figures) or skipping key premises.
Overall, the key insight is simple but powerful: make models prove their claims with native, cross-modal evidence chains, and score them on that process—not just the final answer.
03Methodology
At a high level: Input document → Interleaved parsing (SIN-Data) → Pick a task (Find/Verify/QA/Summary) → Model outputs answer plus evidence chain (or decision) → MRL scoring under “No Evidence, No Score.”
🍞 Hook: Think of building a science binder: first organize notes and graphs, then answer workbook questions, always citing the exact page.
🥬 The Concept: SIN-Data pipeline (how papers become study-ready)
- What it is: Turning raw papers into a clean, interleaved format.
- How it works:
- Element parsing: Convert LaTeX/XML into structured text; extract figures and tables.
- Semantic-first formatting: Insert each figure before its first in-text citation and tag it (x1, x2, …).
- Quality filtering: Remove broken links or low-visual-density papers; balance across disciplines.
- Why it matters: If figures and text don’t line up, later evidence checks collapse. 🍞 Anchor: A physics paper becomes Markdown where “Figure x3” sits right where “Figure 3 shows…” appears, so the model can cite x3 precisely.
🍞 Hook: Imagine four mini-games that together feel like how researchers really read a paper.
🥬 The Concept: The four SIN-Bench tasks
- What it is: A progressive workflow—Discovery → Verification → QA → Synthesis.
- How it works:
- SIN-Find (evidence discovery): Given a question, output an ordered chain of anchors [x1, text span, x2, text span, …].
- Why needed: Without finding the right anchors in order, later reasoning is shaky.
- Example: “Which dataset setting caused the accuracy jump?” → cite the method paragraph and the result figure in sequence.
- SIN-Verify (hypothesis verification): Given (Document, Question, Answer, Evidence), judge if evidence truly supports the answer (yes/no).
- Why needed: Stops persuasive but insufficient evidence.
- Example: If a claim compares models but the figure shows a different dataset, the decision should be “no.”
- SIN-QA (grounded reasoning): Output both the answer and its evidence chain jointly.
- Why needed: Forces the model to reason with traceable support.
- Example: “Does the new loss function reduce overfitting?” → provide conclusion and cite the regularization section plus a generalization plot.
- SIN-Summary (evidence-anchored synthesis): Produce a multi-claim summary where each claim has its own anchors.
- Why needed: Tests big-picture integration across the whole paper.
- Example: Bullet 1 (method) + x2; Bullet 2 (main result) + x5; Bullet 3 (limitation) + x7.
- SIN-Find (evidence discovery): Given a question, output an ordered chain of anchors [x1, text span, x2, text span, …].
- Why it matters: These tasks mirror how scientists actually operate. 🍞 Anchor: A strong model can find the right figures, prove support, answer carefully, and summarize with citations.
🍞 Hook: Grading time! But not just “right or wrong”—we also check if the proof matches, matters, and makes sense in order.
🥬 The Concept: MRL metrics under No Evidence, No Score
- What it is: Three lenses for judging evidence chains—Matching, Relevance, and Logic—plus answer/verify scoring by task.
- How it works:
- Matching (M): For each cited figure, check if the paired text span says what the ground-truth span says (LLM judge gives 0–3 scaled to 0–1).
- Relevance (R via F1): Count true-positive evidence units (right figure + good text); balance precision vs recall.
- Logic (L via Kendall–Tau): Compare the order of matched anchors to the gold order (0–1 similarity).
- SIN-QA adds Answer Accuracy (LLM-graded 0–1); SIN-Verify uses standard accuracy.
- Why it matters: If Matching is low, the text claim doesn’t really align; if Relevance is low, you’re citing extra junk or missing must-haves; if Logic is low, your steps are scrambled. 🍞 Anchor: A chain like [x2, caption; x5, result paragraph] that’s out of order drops Logic even if both pieces are relevant.
🍞 Hook: Sometimes wrong answers look fancy. Let’s make near-misses hard to pass.
🥬 The Concept: Easy vs hard negatives in verification
- What it is: Two kinds of trick questions for SIN-Verify.
- How it works:
- Easy negatives: Obviously irrelevant evidence; models should reject them.
- Hard negatives: Near-miss evidence that seems related but is insufficient or misaligned (e.g., wrong setting, missing premise).
- Why it matters: Only hard negatives reveal whether the model truly checks sufficiency, not just surface overlap. 🍞 Anchor: If the claim is about “test set” but the figure shows “validation set,” a good verifier says “no support.”
🍞 Hook: Think of coaches and referees working together—multiple viewpoints catch more mistakes.
🥬 The Concept: Building the benchmark (multi-MLLM + humans)
- What it is: A scalable, careful pipeline to create reliable test items.
- How it works:
- Multi-MLLM synthesis: Strong models co-generate questions, answers, and draft evidence chains.
- Cross-validation: A three-model jury scores question quality, answer correctness, and evidence consistency (1–5); keep only high scorers.
- Human-in-the-loop audit: Graduate students confirm anchors are precise and support is rigorous; fix any issues.
- Why it matters: This keeps the data challenging but fair, and cuts hallucinations. 🍞 Anchor: If two models disagree about whether x4 truly supports a claim, a human reviewer settles it before release.
Secret sauce highlights:
- Interleaved-by-first-citation formatting aligns evidence with its narrative moment.
- Jointly generating answers and evidence in SIN-QA stabilizes reasoning (acts like a light chain-of-thought).
- “No Evidence, No Score” ensures answer-only guesses can’t win, shifting training and prompting behaviors toward grounding.
04Experiments & Results
🍞 Hook: Imagine a science contest where judges don’t just look at your ribbon—they flip through your lab notebook to see if your data backs it up.
🥬 The Concept: The test and the scoreboard
- What it is: Eight multimodal models are tested on four tasks, scored by evidence quality (MRL), answer correctness, and verification accuracy.
- How it works:
- Tasks: SIN-Find, SIN-Verify, SIN-QA, SIN-Summary.
- Metrics: Matching, Relevance (F1), Logic (Kendall–Tau), Answer Accuracy (LLM-graded), Verify Accuracy.
- Overall: Average the metrics per task; then average across tasks.
- Why it matters: It reveals whether models can both be right and prove it. 🍞 Anchor: A model that says “better accuracy” but cites the wrong figure loses Matching and Relevance even if the words sound fine.
Results with context:
- Gemini-3-pro gets the best average overall grounded score (about 0.566). That’s like getting a solid A- when others average a B.
- GPT-5 scores the best raw answer accuracy on SIN-QA (0.767) but loses more points when matching its answers to solid evidence, so its overall grounded scores on FIND/QA trail Gemini-3-pro.
- Claude-sonnet-4.5 is precise at finding anchors in SIN-Find (strong overall there), showing good evidence-spotting skills.
- Open-weight Qwen3-VL-8B outperforms its larger 30B MoE cousin on several metrics, hinting that careful fine-tuning beats size alone for this kind of reasoning.
- Several open-weight models struggle to output the structured evidence format, causing invalid scores—formatting and grounding are non-trivial.
🍞 Hook: Layout matters—put the picture next to the paragraph and the puzzle pieces snap into place.
🥬 The Concept: Interleaving boosts performance
- What it is: Keeping figures next to their first-citing text helps models.
- How it works:
- Compare “interleaved” vs “non-interleaved (images then text).”
- Performance jumps notably for SIN-QA (+0.102) and SIN-Summary (+0.129) with interleaving on Gemini-3-pro.
- Text-only captions help a bit, but lose fine details; image-only pages help least without nearby text.
- Why it matters: Realistic, native layout supports real comprehension. 🍞 Anchor: When “Figure x5” sits by “Figure 5 shows…,” the model is less likely to cite the wrong plot.
🍞 Hook: Show your work, get better grades.
🥬 The Concept: Evidence-chain requirement improves answers
- What it is: Forcing the model to output chains raises its own answer accuracy.
- How it works:
- Compare SIN-QA with and without required evidence chains.
- Gemini-3-pro improves from 0.694 to 0.726 on answer accuracy when chains are required.
- Why it matters: Building the chain guides thinking—like a study outline that prevents off-topic guesses. 🍞 Anchor: Asking “Which figure proves this?” nudges the model to double-check before answering.
🍞 Hook: Some trick questions look right—but the proof doesn’t actually fit.
🥬 The Concept: Hard negatives expose shallow verification
- What it is: Near-miss evidence is tough.
- How it works:
- Easy negatives: All models near 1.0 accuracy (clear mismatches).
- Hard negatives: Accuracy drops sharply (e.g., GPT-5 from 1.0 to ~0.208), showing limited skill at detecting insufficient-but-related evidence.
- Why it matters: True auditing means catching missing premises and wrong conditions, not just spotting irrelevant pages. 🍞 Anchor: If a claim is about “final test accuracy” but evidence shows “validation accuracy,” good verifiers must say “no support.”
Other observations:
- Length stability: Strong models handle very long text reasonably well, but mixing in many similar figures can raise variance—more chances to cite the wrong one.
- Domain differences: Models tend to do better in Economics/Medicine and worse in Mathematics, suggesting symbolic precision is still hard.
Bottom line: This benchmark turns hidden weaknesses visible—especially the gap between sounding correct and being provably correct inside the document.
05Discussion & Limitations
🍞 Hook: Even great students have areas to improve; knowing them helps you study smarter.
🥬 The Concept: Honest limits and when to be careful
- What it is: Where this benchmark—and today’s models—still struggle.
- How it works:
- Limitations:
- Model IO limits: Not all MLLMs accept very long, interleaved inputs; this restricts who can be tested.
- Strict filtering: Some useful but imperfect papers are excluded, trading scale for cleanliness.
- LLM judge reliance: Although validated, automatic judges still aren’t humans and may misgrade tricky edge cases.
- Required resources:
- Interleaved parsing pipeline and storage for long documents with images.
- Access to strong MLLMs for synthesis and judging.
- Human reviewers (graduate level) for spot checks and final audits.
- When not to use:
- If your domain has minimal figures/tables and short texts—simpler QA datasets might suffice.
- If you only need surface retrieval stress-tests—classic NIAH may be enough.
- If you can’t provide or parse stable anchors (e.g., scanned documents with unreadable figures).
- Open questions:
- Can we automate anchor extraction and logic checking further without losing reliability?
- How to better detect and penalize “shotgun citations” automatically?
- Can models learn to self-correct chains (repair missing premises) on the fly?
- How to scale domain coverage to math-heavy proofs where visual cues are sparse?
- Limitations:
- Why it matters: Clear boundaries guide future improvements and fair use. 🍞 Anchor: If your dataset is mostly equations without figures, you may need a math-focused extension before using SIN-Bench as your main test.
06Conclusion & Future Work
Three-sentence summary: This paper introduces the Fish-in-the-Ocean paradigm and SIN-Bench to evaluate whether AI can read long, multimodal scientific papers and prove its answers with native evidence chains. It builds SIN-Data to keep text and figures interleaved, defines four tasks from discovery to synthesis, and scores with Matching, Relevance, and Logic under a strict No Evidence, No Score rule. Experiments show a clear gap between answer-only correctness and truly grounded reasoning, pushing the field toward transparent, evidence-backed AI.
Main achievement: Turning evaluation from “Did you answer?” into “Can you show and connect the exact evidence inside the paper that proves your answer?”—and making that process measurable and comparable across models.
Future directions:
- Teach models to repair weak chains (auto-add missing premises, fix order) before finalizing answers.
- Expand to math-dense domains and richer table reasoning with cell-level anchors.
- Build training curricula that reward chain quality directly, not just final answers.
- Develop stronger, standardized judges that align even closer to human experts while resisting adversarial tricks.
Why remember this: SIN-Bench changes the rules in a healthy way—AIs don’t just need to be right; they need to be right for the right reasons, with receipts (anchors) inside real papers. That’s the kind of skill scientists—and trustworthy AI assistants—actually need.
Practical Applications
- •Build lab-assistant AIs that answer research questions with exact figure and paragraph citations.
- •Create peer-review helpers that flag claims lacking sufficient, properly ordered evidence.
- •Develop study tools for students that summarize papers with cite-as-you-write anchors.
- •Support clinical evidence extraction by linking medical claims directly to trial figures/tables.
- •Power grant/proposal reviewers that verify if claimed gains match documented setups and results.
- •Train explainable QA systems where every answer must include a compact, ordered evidence chain.
- •Detect academic fraud by checking if summaries or claims have verifiable in-document anchors.
- •Fine-tune retrieval-augmented generation to prefer interleaved, native evidence over parametric guesses.
- •Assist data curators in reorganizing PDFs into interleaved formats that preserve reasoning flow.
- •Benchmark enterprise document assistants on grounded synthesis across reports, not just answer-only QA.