From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Zongzhao Li; Xiangzhe Kong; Jiahui Su; Zongyang Ma; Mingze Li; Songyou Li; Yuelin Zhang; Yu Rong; Tingyang Xu; Deli Zhao; Wenbing Huang

From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Intermediate

Zongzhao Li, Xiangzhe Kong, Jiahui Su et al.12/11/2025

arXiv PDF

Key Summary

•The paper defines Microscopic Spatial Intelligence (MiSI) as the skill AI needs to understand tiny 3D things like molecules from 2D pictures and text, just like scientists do.
•It builds a big test called MiSI-Bench with 163k questions and 588k images from about 4,000 protein–ligand structures to check if Vision-Language Models (VLMs) can do this.
•Nine tasks start simple (move, turn, zoom) and grow harder (find hydrogen bonds, dock a ligand back into a pocket), mirroring what real scientists do.
•Current top VLMs perform far below human level on these microscopic tasks, especially on multi-step 3D transformations and science-grounded bonding questions.
•Humans do well on basic tasks but struggle when transformations stack up (like two rotations in a row) or when zoom levels make scale cues tricky.
•A small 7B open model, after fine-tuning on MiSI-Bench, becomes excellent at spatial transformations and even beats humans on rotation-heavy tasks.
•However, even the fine-tuned model is weak at scientific interaction recognition (like hydrogen bonds), showing domain knowledge is missing.
•The benchmark uses clean 2D orthographic views (front/left/top) and precise text templates to make the 3D reasoning measurable and fair.
•Results suggest that adding explicit biochemical knowledge and better 3D spatial training could move VLMs much closer to useful scientific assistants.
•MiSI-Bench is released publicly, so others can train and test models toward true scientific AI.

Why This Research Matters

Medicines, green materials, and enzymes all depend on correctly understanding how molecules fit and interact in 3D. MiSI-Bench shows today’s general AI isn’t yet ready for that microscopic world and pinpoints exactly where it struggles. With the right training data, even small models can learn strong 3D spatial habits, which could speed up early-stage drug and material design. But to be truly helpful, AI also needs explicit scientific knowledge about bonds and interactions, not just geometry. This benchmark provides a public, reproducible way to measure progress toward that goal. In short, it’s a roadmap for turning visual chatbots into trustworthy scientific assistants.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 You know how you can look at a Lego model from the front, the side, and the top, and your brain can imagine its 3D shape? Scientists do the same thing with molecules, except the pieces are atoms you can’t see with your eyes.

🥬 The Concept: Spatial Intelligence (macro world)

What it is: The ability to understand where things are in space and how they move or fit together.
How it works: 1) Notice objects; 2) Track positions and angles; 3) Predict results of moves and rotations; 4) Use rules (like left/right, up/down) to reason.
Why it matters: Without it, robots and AIs can’t reliably navigate rooms, pack boxes, or assemble parts. 🍞 Anchor: A robot stacking blocks needs spatial intelligence to keep the tower from falling.

🍞 Imagine shrinking from building blocks down to atoms. The game changes: tiny shapes, tight fits, and invisible forces.

🥬 The Concept: Microscopic Spatial Intelligence (MiSI)

What it is: The skill of understanding the 3D relationships of atoms and molecules from 2D images and text, and reasoning about their interactions.
How it works: 1) View molecules from orthographic angles; 2) Reconstruct 3D in your mind; 3) Apply moves/rotations/zoom; 4) Check scientific relations like hydrogen bonds.
Why it matters: Without MiSI, AI can’t help design drugs, understand protein pockets, or reason about chemical interactions. 🍞 Anchor: A drug fits into a protein pocket like a key into a lock; MiSI helps AI see and reason about that fit.

🍞 Think of a tour guide who can look at a picture and also read a description, then explain both together.

🥬 The Concept: Vision-Language Models (VLMs)

What it is: AI systems that learn from images and text jointly, so they can connect what they see to what they read.
How it works: 1) Encode images; 2) Encode text; 3) Fuse the two; 4) Generate answers grounded in both.
Why it matters: Without VLMs, AI might see shapes but not understand their names or functions, or read words but not link them to visuals. 🍞 Anchor: When you ask about a picture of Paris, a VLM can spot the Eiffel Tower and also explain its history.

The World Before: AI got pretty good at everyday, big-world (macroscopic) spatial tasks like understanding room layouts or recognizing objects. But molecules are different: you must read 2D views carefully to reconstruct 3D, and then apply strict chemical rules.
The Problem: No standard way existed to measure whether VLMs can do atom-level 3D thinking and relationship spotting (like hydrogen bonds). Without a fair test, it’s hard to train or trust AI for science.
Failed Attempts: Existing benchmarks focus on visible, macroscopic scenes or use pure 3D coordinates with specialized models (e.g., equivariant GNNs). These don’t check whether multimodal AIs can do the scientist-like dance of seeing molecular views and reasoning in natural language.
The Gap: We need a benchmark that looks like how scientists actually work—orthographic 2D images plus precise language—measuring both spatial moves (translation, rotation, zooming) and relational science (like bonds) from the same visuals.

🍞 Imagine blueprint drawings for a machine, plus written instructions. You need both to understand and build it. That’s the gap this paper fills for molecules.

Real Stakes: Medicines, clean materials, and enzymes we rely on all come from understanding molecules’ 3D shapes and interactions. If AI can truly master MiSI, it could help scientists discover treatments faster, reduce trial-and-error, and explain its reasoning clearly to humans.

02Core Idea

🍞 Imagine teaching a kid to juggle: start with one ball, then two, then three. If they can’t juggle one, they won’t manage three.

🥬 The Concept: The “Aha!”

What it is: Break microscopic spatial thinking into simple building blocks (move, turn, zoom, and identify interactions) and test VLMs on these blocks alone and in combinations using 2D views—just like scientists do.
How it works: 1) Show front/left/top images; 2) Ask for the specific move/turn/zoom or interaction; 3) Combine steps into multi-stage challenges; 4) Score precisely.
Why it matters: Without decomposing the skills, we can’t see where models fail or how to train them up. 🍞 Anchor: It’s like a piano exam with scales (basics) and full songs (composites) to measure true skill.

Multiple Analogies for the Same Idea:

Recipe Analogy: Learn to chop (translation), stir (rotation), adjust heat (zoom), then cook a full dish (dock a ligand).
Map Analogy: Slide the map (translation), rotate it to face north (rotation), zoom for detail (zooming), then find where two roads connect (hydrogen bond).
Sports Analogy: Practice dribbling, passing, and shooting separately; then play a full game combining them.

🍞 The Concept: Orthographic Projection

What it is: Showing a 3D object in flat, exact views (front/left/top) without perspective distortion.
How it works: 1) Choose fixed camera directions; 2) Project atoms straight onto the 2D plane; 3) Keep scale consistent; 4) Use multiple views to infer depth.
Why it matters: Without clean views, it’s hard to tell how far or in which direction to move or rotate—errors pile up. 🍞 Anchor: Think of a box drawn as front, side, and top blueprints you can assemble in your head.

Before vs After:

Before: VLMs looked good on everyday images but we didn’t know if they could think at atom-scale.
After: MiSI-Bench shows they struggle at microscopic 3D tasks, but a small model improves dramatically after fine-tuning—especially on rotations—while still missing domain knowledge for bonds.

Why It Works (intuition, not equations):

Decompose-and-recombine exposes exactly which micro-skill is missing.
Multi-view images remove perspective tricks, forcing true geometric reasoning.
Structured prompts (cloze/MC) and decoy options make mistakes diagnosable (wrong axis, wrong sign, wrong magnitude).
Weighted scoring gives credit when a model gets the axis right but the angle slightly off—more informative than all-or-nothing.

Building Blocks (introduced with mini-sandwiches):

🍞 Translation: Like sliding a book on a table. It’s moving along x or y. 🥬 What/How/Why: Move in a straight line; measure distance and sign; without it, models can’t align views. 🍞 Anchor: “move x 4” shifts the complex to the right.
🍞 Rotation: Like turning a steering wheel. It’s spinning around x, y, or z. 🥬 What/How/Why: Pick an axis; set direction/angle by the right-hand rule; without it, orientation matching fails. 🍞 Anchor: “roll y 15” tips the scene left/right.
🍞 Zooming: Like moving a camera closer or farther. 🥬 What/How/Why: Change depth along z; pick the right amount; without scale control, details vanish or overwhelm. 🍞 Anchor: “move z 50” zooms in to inspect a pocket.
🍞 Relational Identification: Like noticing two magnets click together. 🥬 What/How/Why: Spot specific atom–atom patterns (e.g., hydrogen bonds); confirm distances/angles; without it, science grounding is missing. 🍞 Anchor: “ARG NH2, O2” means the residue’s donor bonds to the ligand’s oxygen.
🍞 Unit Tasks vs Composite Tasks: 🥬 What/How/Why: Units test one skill; composites chain skills (e.g., rotate then translate). Without both, we can’t measure real, multi-step reasoning. 🍞 Anchor: Learn each Lego move, then build the castle.

03Methodology

At a high level: Molecules (PDBbind) → Multi-view images + bond annotations (ChimeraX) → Question templates (unit + composite tasks) → Train/fine-tune VLMs → Evaluate with accuracy and weighted scores.

🍞 Imagine photographing a toy from three fixed sides, then asking a friend to tell you how to move or turn it to match another photo.

🥬 The Concept: Dataset Construction Pipeline

What it is: A repeatable recipe to turn 3D protein–ligand complexes into 2D views plus precise Q&A labels.
How it works:
1. Collect structures from PDBbind; keep the pocket and ligand; color atoms by type; label residues and ligand atoms.
2. Use ChimeraX to render orthographic images (front/left/top and full 6-view sets when needed) and compute hydrogen bonds; also record on-screen coordinates of atoms.
3. Fill fixed text templates to create question–answer pairs for each task.
Why it matters: Without consistent views and labels, models can’t be fairly trained or tested. 🍞 Anchor: Like making flashcards from a textbook: the same layout for every card, different content per topic.

Step-by-step details and examples:

Inputs and Views

Inputs: ~4,000 protein–ligand complexes; final splits: 3,503 for training, 490 for testing; total 163,514 QAs and 587,975 images.
Views: Orthographic front/left/top for most tasks; all six views (front/left/top/back/right/bottom) for interaction tasks to reduce occlusion confusion.
Example: For a rotation task, you see the initial 3 views and one final front view; your job is to say “roll y 15.”
Why this step exists: Multiple fixed views enable precise 3D inference. Without them, depth and axis signs are guessy.

Unit Tasks (single skill)

Translation (cloze): • What happens: The complex shifts along x or y by −4 to 4 Å (binned by 1 Å). You fill “move x 3.” • Why it exists: Tests direction and magnitude understanding on the screen plane. • Example: If Image 4 is 4 Å to the right: “move x 4.”
Rotation (cloze): • What happens: The complex rotates around x/y/z by −90° to 90° (binned by 15°). You fill “roll y −30.” • Why it exists: Checks axis choice and angle estimation. • Example: A tip that raises the top and lowers the bottom likely means a rotation around x.
Zooming (cloze): • What happens: The complex moves along z by 40–60 Å (binned by 1 Å); this corresponds to readable molecular scales. • Why it exists: Teaches scale control; too small or too large destroys detail. • Example: “move z 50” zooms in to center interactions.
Residue–Ligand Interaction (cloze): • What happens: Given six views of a residue and the ligand, you answer “No” or list all hydrogen bonds like “ARG NH2, O2G.” • Why it exists: Tests domain-grounded relational identification. • Example: “Yes: VAL O, N1; VAL O, N2.”

Composite Tasks (multi-step)

Translation + Rotation (MCQ): • What happens: From a reference complex, infer a rotation then a translation; apply the same to a new complex; pick the correct result (A–D). • Why it exists: Chaining steps detects error accumulation and frame-of-reference slips. • Example: If the ref does “roll z 30; move x −3,” choose the option showing the target with the same net change.
Rotation + Rotation (MCQ): • What happens: Two rotations on different axes; apply in the same order; select the correct result. • Why it exists: Rotations don’t commute; order matters; this stresses working memory. • Example: “roll x 30” then “roll y −45” ≠ the other way around.
Interaction Location (MCQ): • What happens: Given a known hydrogen bond pair, pick the translation that centers it. • Why it exists: Connects relational understanding to spatial control. • Example: “move x 7, move y 5” centers the bond.
Ligand Docking (cloze): • What happens: Pocket views + displaced ligand views + combined front view; predict the rotation(s) then translation to re-dock: e.g., “roll x 75, move x −14.” • Why it exists: Emulates real docking logic—align orientation, then place. • Example: If the ligand is tilted and right-shifted, first undo the tilt (rotation), then slide it left.
Pocket–Ligand Interaction (cloze): • What happens: From six views, list all hydrogen bonds between the whole pocket and ligand with residue numbers and chain IDs. • Why it exists: Tests global relational reasoning across many candidates. • Example: “ASN 460 ND2 A, O2; GLU 537 OE1 A, O2.”

Clever Decoys and Scoring (the “secret sauce”)

Decoys: Crafted by tweaking magnitudes, flipping signs, or changing axes—so we can tell if a model’s mistake is the angle, direction, or axis.
Metrics: • Accuracy for MCQ and zooming. • Weighted composite scores for cloze: partial credit for right axis with near-right magnitude; equal weighting per sub-operation in multi-step answers; penalties for listing too many/hallucinated bonds.
Why it matters: Nuanced scoring reveals partial understanding and avoids rewarding guessy enumerations.

🍞 Anchor: It’s like a driving test with exact cones to weave through and point deductions that show whether you drifted left or misjudged distance—even if you didn’t fully fail the course.

04Experiments & Results

🍞 Picture a science quiz where students, teachers, and robots all try the same questions. Who really understands the tiny 3D world?

🥬 The Concept: The Test Design

What it is: Compare many strong VLMs, a fine-tuned small model, and humans on MiSI-Bench (and a tiny subset for expensive models).
How it works: 1) Few-shot prompts for models; 2) Human PhD participants for tough tasks; 3) Use accuracy and weighted scores; 4) Report unit vs composite results.
Why it matters: Without apples-to-apples comparisons, we can’t tell progress from hype. 🍞 Anchor: It’s like a track meet where sprinters, marathoners, and students all run the same measured distances.

What They Measured and Why

Accuracy on multiple-choice tasks and zooming: checks if models can discriminate correct spatial outcomes.
Weighted scores on cloze tasks: grants partial credit when axis is right but angle is a bit off, which is more diagnostic than exact-match only.
Special penalties against “list everything” answers in bond finding to discourage guessing.

Who Competed

Closed-source “reasoning” VLMs from major labs; advanced open-source baselines; and a small open 7B model fine-tuned on MiSI-Bench.
Humans: STEM PhDs (for spatial tasks) and biology PhDs (for bond/docking tasks) as a practical upper bound.

The Scoreboard (with context)

Most advanced VLMs performed far below human level across MiSI-Bench, especially on composite spatial transformations (two-step rotations or rotation+translation), often near random-guess accuracy.
Distance-based skills (translation, centering a bond) were generally easier than rotation-based skills (choosing the right axis and angle), mirroring human intuitions about 2D-trained vision.
Humans did very well on unit tasks but struggled when transformations stacked: errors accumulated and reference frames shifted, causing big drops on two-rotation problems and challenging zoom scales.
The fine-tuned 7B model showed striking gains: near-perfect performance on unit spatial transformations and around human-beating accuracy on rotation-heavy composites—evidence that VLMs can unlock strong 3D spatial habits with domain-targeted fine-tuning.
However, the same model lagged on scientifically grounded tasks (hydrogen bond recognition at residue or pocket scale), indicating that spatial skill alone isn’t enough—you need explicit biochemical knowledge.

Surprising Findings

A small fine-tuned model can outplay giant general models on microscopic spatial tasks, suggesting domain adaptation—not just model size—matters a lot.
Humans, though experts, found consecutive rotations unintuitive; keeping track of axes and order is hard even for us.
Models were better at saying “no bond” (when atoms are far) than listing all true bonds; precise relational chemistry remains difficult.
Error analysis showed bond detection accuracy drops as the number of bonds rises, and zooming errors peak at visually ambiguous scales—pointing to attention bottlenecks when cues are sparse.

What This Means

Spatial understanding learned from everyday images doesn’t transfer cleanly to the molecular world.
Fine-tuning on the right micro-skills quickly amplifies a model’s 3D reasoning—but scientific knowledge must be injected to master interactions like hydrogen bonds. 🍞 Anchor: Think of a gamer who learns the map layout (spatial skills) fast but still needs rulebooks (science) to win strategy matches.

05Discussion & Limitations

🍞 Imagine a student who’s great at geometry but hasn’t taken chemistry yet. They can rotate shapes perfectly, but can’t tell which atoms like to bond.

🥬 The Concept: Honest Assessment

What it is: A clear look at limits, resources, failure cases, and open questions.
How it works: 1) Name where it struggles; 2) Note what it needs; 3) Say when not to use it; 4) Pose the next big puzzles.
Why it matters: Without this, we overtrust models in high-stakes science. 🍞 Anchor: Before using a ladder, you check if it’s tall enough and on steady ground.

Limitations

Interaction scope: Focuses on hydrogen bonds; other forces (hydrophobic, salt bridges, π–π) aren’t covered yet.
Visual simplifications: Hydrogens hidden, solvents removed; real biophysical contexts are richer and noisier.
2D orthographic only: Great for clarity, but true 3D (volumetric data, flexible proteins) isn’t directly tested.
Dataset domain: PDBbind pockets and ligands may not represent all biomolecular diversity (membrane proteins, RNA, metals, etc.).
Metrics: Weighted scoring is informative but still approximates how close a 3D state is to “chemically right.”

Required Resources

Tools: ChimeraX for rendering and bond detection; scripting to generate views and QA templates.
Compute: GPUs for fine-tuning VLMs on hundreds of thousands of images/QAs; storage for ~600k images.
Expertise: Chemistry/structural biology knowledge to extend interactions or validate tricky cases.

When NOT to Use

Predicting binding affinity or kinetics (thermodynamics/MD are outside scope).
Highly flexible proteins or induced-fit docking without additional modeling.
Electron density maps or cryo-EM volumes (not 2D-rendered here).
Non-protein systems (e.g., materials, RNA complexes) unless adapted.

Open Questions

Knowledge injection: What’s the best way to teach biochemistry—pretraining on curated structural corpora, tool-augmented reasoning, or symbolic chemistry modules?
3D fidelity: Can we couple 2D views with explicit 3D constraints so rotations/placements respect molecular physics?
Generalization: Will MiSI training transfer to other microscopic domains (materials, nanostructures)?
Robustness: How to handle occlusions, noise, or different rendering styles without losing spatial accuracy?
Evaluation: Can we score “chemical correctness” directly (e.g., energy plausibility or clash checks) beyond geometric matching? 🍞 Anchor: It’s like moving from learning chess piece moves (spatial) to mastering openings and endgames (domain knowledge).

06Conclusion & Future Work

🍞 Picture teaching AI to be a junior scientist: first practice moving and turning molecular shapes, then learn which atoms like to hold hands.

Three-Sentence Summary

This paper defines Microscopic Spatial Intelligence (MiSI) and introduces MiSI-Bench, a nine-task benchmark built from ~4,000 protein–ligand structures with 2D orthographic views and structured Q&A.
State-of-the-art VLMs perform far below human level on microscopic spatial reasoning, especially on composite transformations and chemistry-grounded interactions.
A fine-tuned 7B VLM becomes excellent at spatial transformations—even surpassing humans on rotation-heavy tasks—yet still needs explicit scientific knowledge to recognize interactions like hydrogen bonds.

Main Achievement

A clear, scalable framework that decomposes microscopic spatial reasoning into measurable unit skills and realistic composite tasks, revealing exactly where today’s models fail and how fine-tuning helps.

Future Directions

Inject biochemical knowledge during pretraining; pair 2D views with 3D constraints; broaden interactions beyond hydrogen bonds; and test generalization to other micro-worlds like materials.

Why Remember This

MiSI-Bench shifts the focus from “seeing” to “scientifically perceiving,” showing that with the right practice set, even small models can gain strong 3D spatial sense—but true scientific AI will demand both spatial skill and domain wisdom. 🍞 Anchor: Like learning to read music and to play by ear—you need both to become a great musician; MiSI-Bench teaches AI the notes of molecular space.

Practical Applications

•Create training curricula that first teach VLMs translation, rotation, and zooming on molecules, then advance to docking and interaction discovery.
•Use Interaction Location tasks to build tools that auto-center and snapshot specific bonds for lab reports or teaching.
•Integrate MiSI-style fine-tuning into molecular visualization software (e.g., assistants that suggest the next best view or move).
•Develop AI lab helpers that propose candidate docking transformations before physics-based refinement.
•Construct classroom exercises where students learn hydrogen bonds with multi-view images and immediate AI feedback.
•Benchmark new multimodal models for pharma teams to decide if they are reliable for pocket analysis and pose inspection.
•Pre-screen molecular datasets by flagging inconsistent poses or missing annotations using a MiSI-trained validator.
•Guide foundation model pretraining by including orthographic molecular views and chemistry text to seed domain priors.
•Prototype interactive tutorials where a VLM explains why a certain axis/angle is correct, improving scientists’ spatial intuition.
•Automate figure generation for papers: correct view alignment, centered interactions, and verified bond lists.

Version: 1