🎓How I Study AIHISA
đź“–Read
📄Papers📰Blogs🎬Courses
đź’ˇLearn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding | How I Study AI

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Intermediate
Christopher Clark, Jieyu Zhang, Zixian Ma et al.1/15/2026
arXivPDF

Key Summary

  • •Molmo2 is a family of vision-language models that can watch videos, understand them, and point to or track things over time using fully open weights, data, and code.
  • •It introduces nine new open datasets (human and synthetic) for dense video captioning, multi-image QA, video QA, pointing, and tracking—without distilling from closed models.
  • •A special training recipe (message-tree encoding, efficient packing, bi-directional attention, and token-weighting) makes learning from long videos and dense outputs feasible.
  • •Molmo2 can answer questions and also “show its work” by clicking the exact place and time in the video, or by tracking objects across frames.
  • •On video grounding, Molmo2-8B beats strong proprietary and open models (e.g., 38.4 F1 on video pointing vs 20.0 for Gemini 3 Pro; and 35.5 vs 29.6 accuracy for video counting over Qwen3-VL).
  • •On short-video understanding, captioning, and counting, Molmo2 sets a new bar among open models and is competitive with proprietary systems.
  • •Token-weighting balances long captions and short answers so the model doesn’t forget how to do quick questions.
  • •Message-tree encoding and on-the-fly sequence packing deliver up to 15x training efficiency without wasting context tokens.
  • •Molmo2 supports single images, sets of images, and videos, outputting both text and grounded coordinates/tracks in a compact HTML-like format.
  • •Everything (weights, data, and training code) is released to empower open research and reproducibility.

Why This Research Matters

Videos are how the world communicates today, from school lessons and sports to safety cameras and robot eyes. Molmo2 doesn’t just talk about videos—it shows exactly where and when things happen, making answers checkable and trustworthy. Because its weights, data, and code are fully open, anyone can study, improve, and repurpose it without hidden dependencies. This helps teachers, creators, and engineers build tools that jump to the right moment in long videos or follow important objects automatically. It also enables more transparent AI for safety and accessibility, like pointing out hazards or highlighting steps for learners. By proving that open models can reach high levels of grounded video understanding, Molmo2 raises the floor for the entire community.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine watching a sports game with a friend who not only tells you what’s happening but can also pause the video, point to the exact player, and follow them the whole match. That’s more than talking—it’s seeing, pointing, and tracking.

🥬 Concept: Natural Language Processing (NLP)

  • What it is: NLP is how computers understand and use human language.
  • How it works: 1) Read words, 2) figure out grammar and meaning, 3) respond in clear sentences.
  • Why it matters: Without NLP, the model could see but not explain. 🍞 Anchor: When you ask, “Who scored first?”, NLP helps the model answer with words you understand.

🥬 Concept: Video Understanding

  • What it is: Teaching computers to make sense of moving pictures over time.
  • How it works: 1) Split video into frames, 2) notice objects and actions, 3) connect events across time.
  • Why it matters: Without it, the model treats a video like a pile of random photos. 🍞 Anchor: Recognizing that “the red car passes the blue car, then crashes” needs video understanding.

🥬 Concept: Spatial Localization

  • What it is: Knowing where things are in an image.
  • How it works: 1) Spot items, 2) mark their positions (x, y), 3) keep the coordinates consistent.
  • Why it matters: Without location, the model can’t point to anything. 🍞 Anchor: “Point to the dog’s nose” requires correct coordinates.

🥬 Concept: Temporal Reasoning

  • What it is: Understanding when things happen and in what order.
  • How it works: 1) Attach time to frames, 2) link earlier and later events, 3) answer questions like “before/after/when”.
  • Why it matters: Without time, the model can’t tell when a goal was scored or when a cup fell. 🍞 Anchor: “When does the player celebrate?” needs a timestamp.

🥬 Concept: Object Detection

  • What it is: Finding and naming things in pictures.
  • How it works: 1) Scan the image, 2) match shapes to known objects, 3) output identities and positions.
  • Why it matters: Without this, the model can’t find the ball or the player. 🍞 Anchor: “Find the yellow car” starts with detecting the car.

🥬 Concept: Vision-Language Models (VLMs)

  • What it is: VLMs connect vision (images/videos) and language (text) in one brain.
  • How it works: 1) Turn images into tokens, 2) feed tokens and words to a language model, 3) produce answers or descriptions.
  • Why it matters: Without a VLM, we would have separate systems that don’t truly talk to each other. 🍞 Anchor: A VLM answers “What is the boy holding?” after looking at the video.

🥬 Concept: Grounding

  • What it is: Making answers tie back to exact pixels and times.
  • How it works: 1) Add coordinates (x, y) and timestamps (t), 2) link them to words, 3) show points/tracks alongside text.
  • Why it matters: Without grounding, models give vague answers you can’t verify. 🍞 Anchor: “Click where the cup falls” ties the text to the exact spot and moment.

🥬 Concept: Dense Video Captioning

  • What it is: Writing long, detailed stories about everything happening in a video.
  • How it works: 1) Describe each short clip, 2) merge details, 3) ensure time and objects are consistent.
  • Why it matters: Without dense captions, models miss small but important events. 🍞 Anchor: A 15-minute DIY video summarized with all steps, tools, and mistakes.

The world before: Video models were strong but mostly closed-source, so people couldn’t see the training data or copy the recipe. Open models often relied on text made by closed models (distillation), which hid biases and limited progress. Most importantly, many models could answer high-level questions but couldn’t ground those answers in pixels and time.

The problem: Everyday needs—like searching a long lecture, finding exactly when a player crossed a line, or counting how many times a robot grabbed a block—require pointing to exact places and times, not just talking. But even some proprietary models struggled with this.

Failed attempts: Prior open datasets were small, narrow, or generated by closed systems. Training on long videos was slow and memory-hungry. Models often over-learned from long captions and got worse at short answers. And many training pipelines wasted context space with padding.

The gap: The community needed fully open data for video grounding and a training method that handles long, dense outputs without forgetting quick Q&A.

Molmo2’s answer: Release new, large, fully open datasets (no closed-model distillation) for dense video captioning, free-form and long-video QA, pointing, and tracking—plus a training recipe with message-tree encoding, efficient sequence packing, bi-directional attention, and token-weighting so the model balances short and long outputs.

Real stakes: This matters for assistive tools (show me where a step happens), robotics (track the part to pick), education (jump to the exact moment a math trick is shown), safety (spot and follow hazards), and content creation (find the frame where a glitch appears). When models can both say and show, people trust them more.

02Core Idea

🍞 Hook: You know how great teachers don’t just explain an answer—they point at the board and show exactly where it comes from? That mix of talking and pointing makes learning click.

🥬 Concept: The Aha! Insight

  • What it is: Teach a model to both describe and precisely locate things across time using only open data, plus a training recipe that makes long, dense learning efficient and balanced.
  • How it works: 1) Build open datasets for dense captions, QA, pointing, and tracking, 2) train in stages, 3) use message-trees and packing to fit more into context, 4) let visual tokens attend bi-directionally, 5) weight long/short outputs smartly.
  • Why it matters: Without this combo, open models either talk without showing, or learn from hidden data, or forget short answers. 🍞 Anchor: Molmo2 can say “The No. 11 car passes first” and also click the exact frame and spot.

Three analogies for the idea:

  1. Tour guide: Instead of only describing a museum, the guide also shines a laser pointer on each painting and follows a moving spotlight down the hall.
  2. Cookbook + timer: The recipe (caption) explains steps, and the timer (grounding) tells you exactly when to flip the pancake and where in the pan to look.
  3. Detective’s notebook: Notes describe events, pins on a map mark places, and a timeline ties it all together.

🥬 Concept: Attention Mechanism → Bi-directional Attention

  • What it is: Attention helps the model focus on the most relevant bits; bi-directional attention lets visual tokens look at other visual tokens freely across frames.
  • How it works: 1) Score relevance among tokens, 2) allow frames/images to attend forward to each other, 3) fuse details across time.
  • Why it matters: Without bi-directional attention, the model can’t easily link clues across frames. 🍞 Anchor: To answer “Which dancer moved left to right?”, frames need to reference each other.

🥬 Concept: Message-Tree Encoding

  • What it is: A way to organize multiple Q&As, captions, and annotations for the same media as branches of one tree.
  • How it works: 1) Put the video/image once at the root, 2) branch into separate annotations, 3) mask attention so branches don’t tangle.
  • Why it matters: Without it, examples cross-talk and confuse the model; training becomes slow and messy. 🍞 Anchor: One video with branches for “caption,” “pointing,” and “counting” learned side-by-side safely.

🥬 Concept: Token-Weight Strategy

  • What it is: A rule to keep long outputs (like 900-word captions) from overwhelming training compared to short answers (like “3”).
  • How it works: 1) Down-weight very long outputs, 2) up-balance short ones based on sqrt(length), 3) keep skills balanced.
  • Why it matters: Without it, the model becomes great at essays but bad at quick questions. 🍞 Anchor: It’s like making sure a class doesn’t spend the whole week on one super-long assignment.

🥬 Concept: Sequence Packing

  • What it is: A way to pack many short examples together so context isn’t wasted on padding.
  • How it works: 1) Keep a pool of preprocessed examples, 2) solve a small puzzle to fill context exactly, 3) add custom attention masks so tasks don’t leak.
  • Why it matters: Without packing, training wastes space and time—especially with videos. 🍞 Anchor: Like fitting different Tetris pieces so there are no gaps.

Before vs After:

  • Before: Open models often learned from closed-model outputs, struggled with grounding, and wasted compute on padding; long captions drowned out short answers.
  • After: Molmo2 learns from fully open data, points and tracks objects/events, fits more into context with packing and message-trees, and stays balanced with token weighting.

Why it works (intuition, not equations):

  • Dense, human-grounded data teaches details; pointing and tracking force the model to tie words to pixels and time; bi-directional attention links frames; token weighting prevents long texts from dominating; and packing/message-trees supercharge data efficiency. Together, these make an open model that both talks and shows.

Building blocks (the recipe’s pieces):

  • Open datasets: dense captions, AskModelAnything QA, long-video QA, video pointing, video tracking, multi-image QA/pointing.
  • Architecture: ViT visual encoder + connector + LLM.
  • Training stages: image pre-train → joint SFT → long-context SFT.
  • Efficiency tools: message-tree encoding + on-the-fly packing.
  • Modeling tweaks: bi-directional attention, time tokens, token weighting.
  • Grounded outputs: compact HTML-like strings for points and tracks (x, y, time, object ID).

03Methodology

At a high level: Input (image(s) or video) → Vision Encoder (ViT) → Connector (pool + project) → LLM with text/timestamps → Output (free-form text + grounded points/tracks).

🥬 Concept: Vision Encoder (ViT)

  • What it is: The model that turns pixels into tokens the LLM can understand.
  • How it works: 1) Split image/frame into patches, 2) embed patches, 3) pass through transformer layers.
  • Why it matters: Without a good encoder, details like small text or tiny objects get lost. 🍞 Anchor: Reading jersey numbers or tiny road signs needs crisp visual tokens.

🥬 Concept: Connector Module

  • What it is: A bridge that pools and projects visual features so they align with the LLM.
  • How it works: 1) Pool nearby patches (2Ă—2 for images, 3Ă—3 for video frames), 2) share parameters for image and video, 3) project into the LLM’s space.
  • Why it matters: Without a connector, the LLM and vision encoder “speak different languages.” 🍞 Anchor: Like an adapter that lets your laptop plug into a different country’s outlet.

Input handling (images and videos):

  • Images: Multi-crop tiling during training (K=8) and more at test time (K=24) to see high resolution.
  • Videos: Sample frames at 2 fps; up to 128 frames for SFT, 384 for long-context. Always include the last frame, since many players stop there and users reference it.

🥬 Concept: Time Tokens (timestamps)

  • What it is: Special text-like tokens marking when each frame occurs.
  • How it works: 1) Insert “0.0s, 0.5s, …” in the token stream, 2) let the LLM align events to time.
  • Why it matters: Without timestamps, the model guesses “when,” weakening temporal answers and captions. 🍞 Anchor: Answering “When does the cup fall?” requires knowing the clock.

🥬 Concept: Bi-directional Attention for Visual Tokens

  • What it is: Allowing visual tokens to attend to other visual tokens across frames/images.
  • How it works: 1) Configure masks so visual tokens can look forward, 2) prevent leakage between different examples/branches.
  • Why it matters: Without it, connecting actions across frames is harder. 🍞 Anchor: Following a dancer from left group to right group needs cross-frame attention.

Training in three stages (recipe):

  1. Pre-training (image focus):
    • Mix: 60% dense captions, 30% image pointing, 10% language-only.
    • Goal: Build strong captioning and grounding habits early.
    • Why: Without pointing early, later grounding is less stable.
  2. Joint SFT (images + videos + multi-image):
    • Mix categories with tuned sampling rates (e.g., upsample video pointing/tracking, downsample giant synthetic sets).
    • Max sequence length 16,384; batch size 128.
    • Why: Without balanced mixing, the model overfits to data-rich tasks and forgets others.
  3. Long-context SFT:
    • Increase context to 36,864 tokens and 384 frames; use context parallelism.
    • Why: Without this, long-video QA suffers.

🥬 Concept: Token-Weight Strategy (training balance)

  • What it is: A way to keep very long outputs from dominating the loss.
  • How it works: 1) Set small fixed weights for very long captions/pointing, 2) weight others by sqrt(answer length).
  • Why it matters: Without it, quick QA skills degrade. 🍞 Anchor: The model stays good at both essays and short quizzes.

🥬 Concept: Message-Tree Encoding (multi-annotation organization)

  • What it is: Put one media input at the root, branch out per annotation.
  • How it works: 1) Linearize the tree, 2) apply custom masks so branches don’t talk to each other, 3) handle multi-turn dialogues.
  • Why it matters: Without it, annotations collide, hurting learning. 🍞 Anchor: A single video with separate branches for “caption,” “count,” “pointing,” and “tracking.”

🥬 Concept: Sequence Packing (efficiency)

  • What it is: On-the-fly packing of multiple short examples into one long sequence.
  • How it works: 1) Keep a small pool, 2) solve a knapsack-like puzzle to fill tokens and crop slots, 3) emit one packed batch.
  • Why it matters: Without packing, we waste tokens on padding. With it, Molmo2 got ~15Ă— efficiency. 🍞 Anchor: Like perfectly filling a suitcase to avoid wasted space.

Grounded outputs:

  • Compact HTML-like strings for points/tracks: time or image index, object ID, and x/y (0–1000 normalized). Sorted by time, then by x/y.
  • Counting via points: Reused object IDs make counts obvious (highest ID = count).

🥬 Concept: Temporal Grounding

  • What it is: Clicking the exact moment and place where something happens.
  • How it works: 1) Choose the frame with the event, 2) click the location, 3) optionally track it forward/backward.
  • Why it matters: Without it, answers can’t be verified in time and space. 🍞 Anchor: “Click when the ball crosses the line” is temporal grounding in action.

Data pipelines (fully open):

  • Dense video captions (Molmo2-Cap): spoken narrations (transcribed) + frame details merged into long, detailed captions (avg ~924 words/video).
  • Human QA (AskModelAnything) for videos and multi-images: fine-grained questions and refined answers (no closed-model distillation).
  • Synthetic QA (CapQA, SubtitleQA): built from our own captioner + transcripts.
  • Video pointing/tracking: human clicking frames/points; curated academic tracks converted to points; tracking queries crafted to be non-trivial and often multi-object.

Secret sauce summary:

  • Open data for grounding, efficient training (message-trees + packing), balanced learning (token weights), and stronger modeling (bi-directional attention + time tokens).

04Experiments & Results

🍞 Hook: Think of a science fair where everyone solves the same challenges—short quizzes, long essays, and lab demos. Now imagine you also have to point to the exact place in a video where your answer came from.

The tests and why they matter:

  • Short-video understanding (e.g., MVBench, MotionBench): checks quick reasoning about actions and scenes.
  • Long-video understanding (e.g., LongVideoBench, MLVU): tests memory across long durations.
  • Captioning (Molmo2-CapTest): measures how many correct statements the model makes in long descriptions.
  • Counting and pointing (Molmo2-VideoCount, BURST-VC, Molmo2-VideoPointVal): tests whether the model can “show its work” by clicking the right place/time and getting near-accurate counts.
  • Tracking (MeViS, Ref-YT-VOS, Ref-DAVIS, ReasonVOS, Molmo2-Track): measures following objects across frames using J&F (segmentation quality), F1 (point accuracy), and HOTA (association correctness).
  • Human preference: judges pick which answers they prefer across open-ended questions.

The competition:

  • Proprietary APIs: GPT-5, Gemini 3 Pro/2.5.
  • Open-weight baselines: Qwen3-VL, InternVL3.5, GLM-4.1V, etc.
  • Specialized trackers/segmenters (for tracking tasks): VideoLISA, VideoGLaMM, Sa2VA, SAM-based pipelines.

Scoreboard highlights (with context):

  • Video pointing (Molmo2-VP F1): Molmo2-8B scores 38.4 F1 vs Gemini 3 Pro’s 20.0. That’s like doubling the number of correct clicks on a tough find-the-moment test.
  • Video counting (BURST/our count sets): Molmo2-8B hits 35.5 vs 29.6 for Qwen3-VL on a core counting metric—roughly moving from a B to a solid A- when others are around B-.
  • Tracking: On challenging tracking benchmarks (e.g., ReasonVOS, Molmo2-Track), Molmo2 models top open-weight baselines and even outperform specialized segmentation pipelines in aggregate metrics.
  • Short-video understanding and captioning: Molmo2-8B is state-of-the-art among open models on several short-video tasks, captioning, and counting, approaching some proprietary results.
  • Long-video understanding: Competitive but still behind the very best proprietary and some open-weight systems that focus heavily on ultra-long context; long-context SFT boosts performance significantly on long-video QA.
  • Human preference: Annotators rate Molmo2 equal or better than leading open-weight models and even ahead of some proprietary models, indicating well-rounded, helpful answers.

Surprising findings:

  • Point-then-count beats count-only: Asking the model to click all instances first, then count, works better than just predicting a number. Grounding sharpens reasoning.
  • Bi-directional attention and time tokens matter: Removing them hurts QA and captioning, showing temporal linking is crucial.
  • Token weighting trades a tiny bit of caption F1 for much stronger balance on short answers—worth it overall.
  • “VF” (merged video+frame captions) is critical for dense captioning; adding raw transcripts alone helps less.

🥬 Concept: Long-Context Training (post-training stage)

  • What it is: A short final stage with much longer token budgets and more frames.
  • How it works: 1) Expand context to 36,864 tokens and 384 frames, 2) use context-parallel attention to fit in memory, 3) fine-tune on same mixture.
  • Why it matters: Without it, performance on ultra-long videos lags notably. 🍞 Anchor: After this step, Molmo2 remembers more of a movie-length video when answering questions.

Bottom line:

  • If your task is short videos, counting, captioning, or grounding, Molmo2 sets a new bar for fully open models and challenges some proprietary systems.
  • For very long videos (10+ minutes) and ultra-long reasoning, Molmo2 improves with long-context SFT but still has room to grow, partly due to limited open long-video data and compute for extended training.

05Discussion & Limitations

🍞 Hook: Even the best Swiss Army knife has tools it doesn’t include—and you still need a steady hand to use it well.

Limitations (be specific):

  • Ultra-long videos: Performance on very long (10+ minute) videos is behind top proprietary or specialized open systems; open long-video data remains scarce.
  • OCR-heavy and expert reasoning: On document/OCR-intense or deep multi-modal reasoning (e.g., MathVista, MMMU), Molmo2 trails some best-in-class open-weight competitors.
  • Resource costs: Training and serving long videos (hundreds of frames) require significant GPU memory and careful batching/packing; naive use can be slow/expensive.
  • Mask/box outputs: Molmo2 natively outputs points/tracks; converting to segmentation masks relies on external tools (e.g., SAM 2) for some evaluations.

Required resources:

  • GPUs with sufficient memory for 16k–36k token contexts and many frames (context parallelism recommended for long-context use).
  • Efficient data pipelines (packing, message-trees) and tuned sampling to avoid skew.
  • Clean video frame extraction (e.g., 2 fps + last frame) and optional subtitles.

When NOT to use:

  • Tasks needing pixel-perfect segmentation out-of-the-box (use specialized segmenters or pair with SAM).
  • Ultra-long, hours-long surveillance analysis where specialized streaming/temporal compression models dominate.
  • Pure OCR/document workflows where text extraction is the main goal (dedicated OCR/VLMs may perform better).

Open questions:

  • Can we build larger, fully open long-video datasets (10–60+ minutes) to close the long-context gap?
  • How far can grounding-first training push reasoning accuracy (e.g., chain-of-thought with grounded steps)?
  • What’s the best balance between frame rate, crop count, and token budget for mixed video types?
  • Can token-weighting become adaptive per-instance (e.g., curriculum that shifts with model confidence)?
  • How to standardize grounded outputs across models (points, boxes, masks, tracks) for easier interoperability?

🍞 Anchor: Think of Molmo2 as an open, high-precision flashlight for videos: it shines bright on where and when things happen, but for some super-long caves or specialized tunnels, you may still want extra gear.

06Conclusion & Future Work

Three-sentence summary: Molmo2 shows that fully open models can not only describe videos but also point to exact pixels and moments, and even track objects across time. It achieves this with nine new open datasets and a training recipe that balances long and short outputs while packing many annotations efficiently. The result is state-of-the-art open performance on short videos, captioning, counting, and strong video grounding, with competitive results on longer videos.

Main achievement: Bringing precise, point-driven grounding and tracking to open video-language models—using only open weights, open data (no closed distillation), and open code—while delivering a practical, efficient training pipeline (message-trees, packing, token-weighting, bi-directional attention).

Future directions:

  • Grow open long-video datasets and extend long-context training to narrow the gap on 10+ minute videos.
  • Add native mask/box generation alongside points/tracks for richer grounding.
  • Explore grounded chain-of-thought where every reasoning step is localized in space-time.
  • Improve OCR/document-heavy reasoning and multi-modal math with targeted open data.

Why remember this: Molmo2 raises the bar for what open models can do—talk and show, explain and point—making video AI more trustworthy, checkable, and useful for everyone from students and creators to engineers and robots.

Practical Applications

  • •Sports analytics: Automatically point to offside moments, track key players, and count passes or shots.
  • •Education: Jump to the exact time a math technique is demonstrated and highlight the board area used.
  • •Video editing: Find and mark visual glitches or artifacts across frames for quick cleanup.
  • •Robotics: Track target objects for grasping and verify each successful pick with grounded clicks.
  • •Customer support/how-to: Index tutorial videos with grounded steps so users can click to the exact moment.
  • •Security/traffic monitoring: Count vehicles, track unusual motion, and highlight the exact frames of events.
  • •Accessibility: Provide grounded descriptions that point to where actions occur for low-vision users.
  • •Scientific experiments: Track cells/particles over time and count occurrences with precise timestamps.
  • •Quality assurance in manufacturing: Detect, count, and track defects on assembly lines with evidence points.
  • •Content search: Ask natural questions over large video libraries and jump to grounded answers instantly.
#vision-language model#video grounding#pointing and tracking#dense video captioning#bi-directional attention#token weighting#message-tree encoding#sequence packing#long-context training#open weights#open data#video QA#object tracking#temporal reasoning#spatio-temporal localization
Version: 1