T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Qinsi Wang; Hancheng Ye; Jinhee Kim; Jinghan Ke; Yifei Wang; Martin Kuo; Zishan Shao; Dongting Li; Yueqian Lin; Ting Jiang; Chiyue Wei; Qi Qian; Wei Wen; Helen Li; Yiran Chen

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Intermediate

Qinsi Wang, Hancheng Ye, Jinhee Kim et al.3/4/2026

arXiv

Key Summary

•This paper shows that teaching AI to first draw a simple map of a text (nodes and links) before answering questions makes it smarter and more reliable.
•They propose a new prompting style called Structure of Thought (SoT) that tells models to organize key ideas into a small graph, then give the answer.
•They build T2S-Bench, the first big test set that checks how well models can turn text into structure and use that structure for multi-step reasoning.
•T2S-Bench covers 6 science areas, 32 diagram types, and 1.8k carefully checked samples collected from real research papers.
•Across 45 models, average accuracy on multi-hop questions is only about 52% EM, showing lots of room to grow.
•Even top models find node extraction hard in end-to-end structuring (best node accuracy ~58%), though linking is easier (link F1 often >80%).
•Using SoT improves scores across 8 long-text tasks more than regular chain-of-thought, and fine-tuning on T2S-Train adds even more gains.
•Improvements on T2S-Bench correlate with better scores on external long-context benchmarks, suggesting structural skills transfer.
•The paper provides clean evaluation rules, partial constraints, and quality checks so results are fair and comparable.
•Bottom line: making the text structure explicit acts like a universal “map” that helps models Find, Fuse, and Form better answers.

Why This Research Matters

In everyday life, we constantly turn messy information into quick outlines before deciding what to do. This paper gives AI that same habit: make a simple map first, then answer. That change makes search results, summaries, and reports more accurate and easier to check. Doctors, scientists, and policy analysts can inspect the nodes and links to see exactly which evidence was used. Companies can reduce costly mistakes when reading long documents and speed up decision-making. And because structure skills transfer to many tasks, one investment improves many tools people already use.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how, when you study a chapter, you highlight key sentences and draw arrows between ideas so you don’t get lost? That little map you make helps you remember and explain the chapter later.

🥬 The Concept: Long-text AI used to read everything like a single, super-long paragraph. Without a simple map in the middle, models struggled to find the right facts, connect them, and explain clearly. This paper argues that an explicit structure-in-the-middle—like your notes—can make a huge difference.

How it worked before:

Models tried to go straight from long text to the final answer (end-to-end).
Chain-of-Thought (CoT) helped by writing reasoning steps, but often wandered or added noise in text-heavy tasks.
Results on long-context tests stayed stuck around middling scores, especially when multiple documents or steps were needed.

What was hard:

Finding: Pulling exact facts from long documents (you can miss crucial lines).
Fusing: Combining clues across places (can mix things up).
Forming: Writing correct final outputs (can hallucinate or be inconsistent).

🍞 Anchor: Imagine answering “Which two systems work together to stabilize blood pressure during exercise?” If you don’t first list the pieces (nodes) and how they affect each other (links), it’s easy to mix up who controls what. A quick structure, like a mini map, keeps you on track.

🍞 Hook: Picture a LEGO set with no instructions. You might still build something, but it’s slow and shaky. If you had a step-by-step diagram, you’d build faster and stronger.

🥬 The Concept: The paper introduces Structure of Thought (SoT), a prompt that tells the model to first outline key nodes (important items) and links (how they relate), then answer. It also introduces T2S-Bench, a big, careful test to measure how well models do this text-to-structure job.

Why earlier attempts fell short:

Task-specific tricks: Some systems worked only for one kind of input (like tables or SQL), not for general text.
Ambiguous evaluation: Many texts allow multiple valid structures; scoring fairly is tricky.
Quality control: Getting reliable text-structure pairs is hard and time-consuming; noisy data hurts learning and testing.

The missing piece (the gap):

A universal intermediate representation (IR) for text that is simple, reusable across tasks, and verifiable.
A benchmark that is fair, diverse, and grounded in real diagrams with clear ties to text.

Real stakes (why it matters):

Search and research: Better literature reviews and evidence-grounded answers save time and reduce mistakes.
Office work: Stronger summaries, reports, and structured outputs make tools more trustworthy.
Science and policy: Clearer cause-effect reasoning (e.g., health, climate, economics) helps real decisions.

🍞 Anchor: Think of a school project: Find facts (Find), connect them into an outline (Fuse), and write your report (Form). If the outline is missing, your report can drift. A good outline keeps everything tight. That’s exactly what SoT and T2S-Bench bring to AI.

02Core Idea

🍞 Hook: Imagine you’re solving a mystery. Before guessing the culprit, you draw a clue map—who met whom, where, and when. The map keeps your thinking sharp.

🥬 The Concept: The key insight is that an explicit structure made from the text—nodes (key ideas) and links (their relationships)—is a universal intermediate map that boosts reasoning across many tasks. The paper proves two parts: a prompting recipe (SoT) that makes models draw this map before answering, and a benchmark (T2S-Bench) that fairly tests and trains this skill across domains.

How it works (high level):

SoT Prompt: First extract nodes and links from the given text; then answer using that structure.
T2S-Bench: Use real scientific diagrams and matching text to ask multi-step questions that require structure; also test end-to-end extraction of nodes and links.
Fine-tune: Train on T2S-Train-1.2k to make models even better at structuring and reasoning.

Why it matters: Without the structure, models can miss steps, mix sources, and produce wobbly answers. With the structure, models act like careful readers who organize ideas before concluding.

🍞 Anchor: When asked, “How does a new tax lead to healthier diets?” the model first lists nodes (tax, price change, purchase choices, diet quality) and links (tax raises prices → people buy fewer sugary drinks → diet quality improves). Then it answers precisely.

Three analogies for the same idea:

Map analogy: The structure is a road map from facts to conclusions; you stop getting lost.
Recipe analogy: The structure is a recipe that lists ingredients (nodes) and steps (links); the final dish (answer) is consistent.
LEGO analogy: The structure is the instruction booklet; the model assembles sturdy builds instead of guesswork.

Before vs. After:

Before: End-to-end guessing, CoT sometimes noisy for long text; unstable retrieval and generation.
After: SoT offers a clean, inspectable structure; answers get more accurate and auditable.

Why it works (intuition, not math):

Attention gets anchored: The model must name important pieces, so it focuses on the right spans.
Multi-hop made concrete: Links force explicit steps; fewer leaps, more solid reasoning.
Audit trail: If the answer is wrong, you can check the structure and fix the root cause.

Building blocks (introduced with sandwiches):

🍞 Hook: You know how outlines help essays? 🥬 Intermediate Representation (IR): A simple, reusable format (nodes+links) that sits between reading and answering; it stabilizes thinking. Without it, answers wobble. 🍞 Anchor: Like an outline that stops your essay from rambling.
🍞 Hook: Ever follow multiple clues in a scavenger hunt? 🥬 Multi-hop Reasoning: Solving requires chaining several nodes via links; SoT makes each hop explicit. Without it, models skip steps. 🍞 Anchor: “A → B → C” is written down, so the treasure (C) is reached logically.
🍞 Hook: Think of drawing the whole comic strip, not just one panel. 🥬 End-to-End Structuring: From raw text, list all key nodes and links. Without skill here, you miss pieces and break the story. 🍞 Anchor: When the node list misses a character, the plot falls apart.

Bottom line aha: Make the structure first; the answer gets easier, clearer, and more correct.

03Methodology

🍞 Hook: Imagine a cooking show. First, the chef lays out all the ingredients and how they connect to the dish. Then they start cooking. No guessing, no chaos.

🥬 The Concept: The method is a recipe with two parts: a prompting trick (SoT) and a benchmark kitchen (T2S-Bench) stocked with clean ingredients (text + structure) and tasting tests (multi-hop questions and end-to-end extraction).

At a high level: Input text → (Step A) Build structure (nodes + links) → (Step B) Reason over structure → Output answer; and for evaluation: Text → Predict structure → Score nodes and links fairly.

Step-by-step details:

Structure of Thought (SoT) Prompt

What happens: The prompt forces the model to output a JSON structure section first: nodes with labels and directed links, then give the final answer.
Why this step exists: It anchors attention on the important entities and their relationships; without it, the model can ramble or skip steps.
Example: From a health article, nodes could be “Tax,” “Price Increase,” “Reduced Sugary Drinks,” “Better Diet Quality.” Links show cause → effect. Then the answer explains the chain.

Building T2S-Bench: Real, diverse data

What happens: The team collects structural diagrams and matching text from vetted scientific papers across 6 domains and 32 structure types. Automated tools and large models filter, crop, and check that each figure can be turned into a clean node-link graph and that the text truly describes it. Human experts then quality-check everything.
Why this step exists: Real diagrams reduce hallucinations and guarantee structural correctness; without this, the dataset would be noisy and unfair.
Example: A diagram from a physiology paper (homeostatic loop) is converted to nodes/links and matched with its descriptive paragraphs.

Multi-hop Reasoning (T2S-Bench-MR)

What happens: They generate multiple-choice questions that require at least two steps across the structure, using four families of reasoning:
- Fault Localization (find where a failure or cause sits),
- Functional Mapping (who aggregates, stores, controls, or mediates),
- Boundary Testing (what holds at the edge cases),
- Counterfactual Reasoning (what changes if you tweak links or nodes).
Why this step exists: It ensures questions truly need structure; without multi-hop, models could shortcut with keyword matching.
Example: “If node X is removed, which downstream effect disappears second?” requires following A→B→C carefully.

End-to-End Structuring (T2S-Bench-E2E)

What happens: Models must extract nodes and links from text with partial constraints for fairness. They score nodes (semantic similarity) and links (F1) separately.
Why this step exists: Many valid structures can exist; partial constraints standardize scoring. Without this, grading would be unfair.
Example: Given text and either all nodes (link task) or all links (node task), the model fills in the missing half.

Training set (T2S-Train-1.2k)

What happens: A balanced, high-quality set is provided to fine-tune models on structuring skills.
Why this step exists: SoT prompting helps zero-shot, but fine-tuning boosts generalization. Without training, many models plateau.
Example: Qwen2.5-7B fine-tuned here shows bigger gains on long-text tasks.

The secret sauce:

Real diagrams + aligned text: cuts hallucinations and forces true structural grounding.
Partial-constraint scoring: fair, comparable evaluation of nodes vs. links.
Template-driven, multi-hop questions: guarantees that structure matters.
SoT prompt: a simple, universal recipe that composes with other techniques and works across tasks and models.

🍞 Anchor: Think of a science fair. Judges give everyone the same materials (texts), a clear rubric (partial constraints), and tasks that need real understanding (multi-hop). Students who first sketch their plan (SoT) build sturdier projects and explain them better.

04Experiments & Results

🍞 Hook: When you practice piano with hands separate, then hands together, you improve faster and play cleaner. Testing models on structure first, then reasoning, shows where they struggle—and how to help them.

🥬 The Concept: The team tested 45 models on two fronts: multi-hop reasoning that depends on structure (T2S-Bench-MR) and end-to-end structuring (T2S-Bench-E2E). They also checked whether training on T2S-Train-1.2k and using SoT helps on popular long-text benchmarks.

The tests (what and why):

Multi-hop QA (EM, F1): Measures if the model can use structure to answer multi-step questions. Why: Multi-hop = real-world reasoning.
End-to-end structuring (Node similarity, Link F1): Measures if the model can extract nodes and links from text. Why: Without good extraction, reasoning won’t scale.

The competition (who):

Proprietary models (e.g., Gemini-2.5-Pro, Claude sonnet, GPT-5.2) and strong open-source lines (Qwen, DeepSeek, Llama/Mistral/Ministral, etc.).

The scoreboard (with context):

T2S-Bench-MR average $EM ≈ 52$ %. That’s like class average near a C, meaning tasks are hard.
Best MR performer: Gemini-2.5-Pro at about 81.4% EM and 91.56% F1—an A range—well above most peers.
End-to-end structuring: Node extraction is the bottleneck. Even the best models hover near 58% node accuracy; link F1 is higher (mid-80s for leaders). Translation: finding the right “who” is hard; connecting known “whos” is easier.
By reasoning type: Boundary Testing and Counterfactual Reasoning are relatively easier for top models; Fault Localization is toughest (tracing exact causal chains is tricky). Functional Mapping sits in the middle.

SoT vs. CoT vs. Direct:

SoT consistently beats both direct answering and CoT on text-heavy tasks across multiple models and datasets.
Example: On 2WikiMultiHopQA and MuSiQue, SoT gains can exceed 10 percentage points, where CoT often adds little or can add noise.

Training helps beyond the benchmark:

Fine-tuning Qwen2.5-7B and LLaMA-3.1-8B on T2S-Train-1.2k boosts T2S-Bench and also improves external long-context tasks (e.g., HotpotQA, GovReport, QMSum), often by 5–10 points depending on metric.
Correlation: Better T2S-Bench MR scores align with better LongBench Pro performance. Structural skills transfer to general long-context reasoning.

Surprising findings:

Node vs. link gap: Models are much better at linking than at correctly identifying the set of nodes, suggesting entity detection and discourse segmentation are key pain points.
Complexity cliff: As reference graphs get bigger (more nodes), link performance drops sharply for many models—robustness to structural complexity is still limited.

🍞 Anchor: It’s like building a family tree from a long story. Most students can draw lines between known relatives (links), but many forget to include some cousins (nodes). The test shows who remembered all the people and who just drew neat lines.

05Discussion & Limitations

🍞 Hook: If you can tidy your room, you find toys faster. But if your labels are wrong or your boxes are missing, tidying won’t help much. Structure helps—when it’s done right.

🥬 The Concept: The work proves that explicit structure boosts performance, but it also reveals current limits and practical needs.

Limitations:

Node extraction is hard: Even top models miss key entities, capping overall gains.
Domain shape: The data comes from scientific papers; general web or narrative texts may require adaptation.
One-to-many structures: Multiple valid maps exist; the benchmark reduces this via partial constraints but can’t cover all variations.
Long, complex graphs: Performance degrades as structures grow; scalability is an open challenge.

Required resources:

Compute for fine-tuning and evaluating many models.
Access to high-quality texts with structural cues (e.g., scientific literature) and human validation time.
Careful prompt design and inference-time formatting to get clean JSON structures.

When NOT to use:

Very short, simple tasks where structure adds overhead and no benefit.
Texts with highly ambiguous or poetic connections where node/link extraction becomes subjective and noisy.
Real-time, ultra-low-latency settings that can’t afford the extra structuring step.

Open questions:

Can we unify node extraction with stronger entity linking and coreference to close the node–link gap?
How to generalize beyond scientific prose to messy web text without losing precision?
Can compositional strategies (e.g., sampling several candidate structures and aggregating) push accuracy further?
How to scale to very large graphs without performance collapse—hierarchies, chunking, or memory-augmented methods?
What safety and privacy practices best prevent misuse of large-scale text-to-structure extraction?

🍞 Anchor: Think of upgrading a school library. Labels (nodes) must be right before shelves (links) make sense. The study shows the plan works; now we need better labels, bigger shelves, and good rules for safe use.

06Conclusion & Future Work

🍞 Hook: You know how drawing a quick outline before writing makes everything smoother? That’s the whole story here: models that outline first, answer better.

🥬 The Concept: Three-sentence summary:

The paper proposes Structure of Thought (SoT), a prompt that makes models build a node–link map of the text before answering, and introduces T2S-Bench to fairly test this skill.
Across 45 models, SoT boosts results on many text tasks, while T2S-Bench shows big headroom—especially in node extraction—proving structure is powerful but still challenging.
Fine-tuning on T2S-Train further improves both in-benchmark and out-of-benchmark performance, and better T2S scores correlate with better long-context reasoning elsewhere.

Main achievement:

Establishing explicit text structure as a universal intermediate representation and providing the first comprehensive benchmark and training set to measure and improve it.

Future directions:

Stronger node discovery (entity, coreference, discourse) to close the biggest gap.
Scalable structuring for large graphs (hierarchical or chunked approaches).
Broader domains (messy web text, narratives) with robust, fair scoring.
Hybrid inference (sample multiple structures, aggregate) and combinations with other reasoning methods.

Why remember this:

Turning text into a simple, shared map makes complex reasoning clearer, more accurate, and more inspectable. Just like students who outline first, models that structure first think better and explain better.

🍞 Anchor: Next time you face a long chapter, try this SoT trick yourself: list the main ideas (nodes), draw arrows (links), and then answer the questions. That’s exactly how this paper teaches AI to read smarter.

Practical Applications

•Literature review assistants that first map studies into cause–effect graphs, then write clear, evidence-grounded summaries.
•Customer support bots that structure long troubleshooting guides into decision trees before suggesting fixes.
•Compliance and policy checkers that extract agency roles and control links from regulations to flag gaps or conflicts.
•Healthcare helpers that structure clinical guidelines into pathways to answer multi-step patient care questions.
•Business report generators that turn earnings calls and filings into node–link maps before drafting insights.
•Educational tutors that outline textbook chapters (concept maps) and walk students through multi-hop questions.
•Scientific discovery tools that convert methods and results into structured pipelines to compare and reproduce findings.
•Risk analysis dashboards that extract actors, assets, and dependencies from documents to simulate counterfactuals.
•RAG systems that build intermediate tables/graphs from retrieved text to improve multi-document QA.
•Project planners that map tasks, owners, and dependencies from specs before producing timelines and status reports.

Version: 1