LoopViT: Scaling Visual ARC with Looped Transformers

Wen-Jie Shu; Xuerui Qiu; Rui-Jie Zhu; Harold Haodong Chen; Yexin Liu; Harry Yang

LoopViT: Scaling Visual ARC with Looped Transformers

Intermediate

Wen-Jie Shu, Xuerui Qiu, Rui-Jie Zhu et al.2/2/2026

arXiv PDF

Key Summary

•Loop-ViT is a vision model that thinks in loops, so it can take more steps on hard puzzles and stop early on easy ones.
•It reuses the same block of layers again and again (weight tying), which lets it reason deeper without needing more parameters.
•A Hybrid Block mixes global attention (to learn rules) with local convolution (to make pixel-precise edits), matching how ARC puzzles work.
•A parameter-free Dynamic Exit watches predictive entropy (uncertainty) and halts when the answer has “crystallized.”
•On ARC-AGI-1, Loop-ViT Large (18M params) hits 65.8% Pass@2, beating a 73M-parameter feed-forward ensemble.
•Even the 3.8M-parameter Small model scores 60.1%, surpassing an 18M standard ViT baseline (54.5%).
•Scaling “time” (iterations) is a better path for abstract visual reasoning than just scaling “space” (parameters).
•Across loops, predictions stabilize and attention shifts from broad exploration to focused execution, showing step-by-step deliberation.
•Dynamic Exit improves the accuracy–compute trade-off by spending compute only when needed.
•Loop-ViT shows that adaptive, iterative computation is a powerful new axis for scaling visual reasoning.

Why This Research Matters

Many real tasks need trial-and-error refinement, not a single shot. Loop-ViT shows how to give a small vision model more thinking time when problems are hard and less when they’re easy. That saves energy, reduces latency on simple inputs, and yields better accuracy on tricky ones—an ideal fit for devices with limited resources. The approach also makes the model’s behavior more understandable: we can see predictions stabilize and attention narrow as it converges. By proving that time (iterations) can replace size (parameters), Loop-ViT opens a practical path to smarter, greener, and more reliable vision systems. This shift can influence everything from robotics to medical imaging where confidence and careful refinement really matter.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re solving a maze with a pencil. Sometimes you find the exit quickly; other times you try, erase, try again, and slowly zero in on the right path. Your brain doesn’t always use the same number of steps—it adapts.

🥬 The Concept (Neural Networks, Vision Transformers, ARC, and Loops—why this research exists):

What it is: Loop-ViT is a vision model that reasons in multiple steps, like you trying different paths in a maze, stopping when it’s sure.
How it works (big picture): Instead of using a tall, one-pass tower of layers (feed-forward), Loop-ViT reuses the same smart block over and over (a loop). It mixes two skills: global attention (to discover the rule) and local convolution (to carefully edit pixels). It also watches its own confidence and halts early when the answer stabilizes.
Why it matters: Many visual reasoning puzzles, like those in ARC-AGI, require multi-step thinking. A single pass often isn’t enough. Without loops, models either grow huge (costly) or fail on multi-step rules.

🍞 Anchor: Think of a tiny robot painter that can scan the whole grid to guess the rule and then make small brushstrokes each loop to fix the picture. It keeps painting until the picture matches its plan, then puts the brush down.

— New concept — 🍞 Hook: You know how a camera takes a picture all at once, while a detective solves a case step by step? Some problems need that step-by-step style.

🥬 The Concept (Neural Networks):

What it is: A neural network is a stack of math layers that turn inputs (like images) into outputs (like labels or new images) by learning from examples.
How it works:
1. Look at the input pixels.
2. Pass them through layers that extract patterns.
3. Adjust weights during training to reduce mistakes.
4. Use the trained weights to make predictions.
Why it matters: It’s the basic machine that powers most modern AI; everything else in this paper builds on it.

🍞 Anchor: Like a kid learning to read: practice (training) tunes the “weights,” and then they can read new stories (inference).

— New concept — 🍞 Hook: Picture cutting an image into tiles and rearranging them so you can scan relationships between all tiles at once.

🥬 The Concept (Vision Transformers, ViTs):

What it is: A ViT turns an image into a set of patches (tokens) and uses attention to let every patch talk to every other patch.
How it works:
1. Split image into patches; turn each into a token.
2. Use self-attention to compare and combine information across all tokens.
3. Stack layers to refine a global understanding.
4. Output a prediction (like a class or an edited image).
Why it matters: Attention captures long-range relationships, which are crucial for discovering rules in puzzles.

🍞 Anchor: It’s like a classroom discussion where every student (patch) can hear every other student before the class votes on the answer.

— New concept — 🍞 Hook: Think of colorful pixel puzzles where a few examples show a rule (like “mirror the red shape”), and you must apply it to a new grid.

🥬 The Concept (ARC-AGI benchmark):

What it is: ARC-AGI is a set of tiny grid puzzles where you infer a rule from 2–4 examples and transform a new input grid to the correct output.
How it works:
1. See a few input→output pairs (demonstrations).
2. Guess the hidden rule (e.g., fill holes, move objects, symmetry).
3. Apply it to the test grid.
Why it matters: ARC measures abstract, step-by-step reasoning rather than surface pattern memorization.

🍞 Anchor: It’s like being shown how to fold two paper cranes and then asked to fold a third by following the same hidden steps.

The world before: Early ARC solvers often converted the 2D grids into text for language models or into 1D tokens for recurrent models. While that used powerful language priors, it threw away the image’s spatial layout, which is vital for pixel-accurate edits. The Vision ARC (VARC) line proved you can solve ARC directly from pixels with Vision Transformers—no text needed. But as puzzles got more complex, feed-forward ViTs hit a wall: just making them wider or deeper brought diminishing returns because the number of steps required to solve a puzzle varies. A single fixed-depth pass can’t flex.

The gap: We needed a way to separate “how big the model is” from “how long it can think.” Humans do variable-step thinking; the model should too.

The problem: How can a vision model do many steps without exploding in size and still know when to stop?

The solution in this paper: Loop-ViT. It reuses the same block (weight tying) to take multiple reasoning steps and watches its own uncertainty (predictive entropy) to decide when to halt. This brings better accuracy with far fewer parameters and smarter use of compute.

Real stakes: This idea helps any task where answers need iterative refinement—like cleaning up a segmentation mask, following physics-like rules in grid worlds, or snapping shapes in design tools. It means smaller models can act smarter by taking more (or fewer) steps as needed.

02Core Idea

🍞 Hook: You know how a chef keeps tasting a soup, adding a little salt, tasting again, and stopping the moment it tastes right? That’s smarter than dumping in a pile of salt once.

🥬 The Concept (The Aha!):

What it is: Let the vision model reuse the same reasoning block in a loop, so it can take as many steps as needed and stop when confident, instead of being forced into one fixed pass.
How it works:
1. Replace a tall stack with one reusable “thought step” (weight-tied block).
2. Iterate this step T times, refining the internal state and prediction.
3. Mix global attention (learn the rule) with local convolution (apply the rule precisely).
4. Measure uncertainty (predictive entropy) each step; stop when it’s low.
Why it matters: This decouples reasoning time from model size, making small models reason deeply and save compute on easy cases.

🍞 Anchor: Like a careful editor who re-reads a paragraph, fixes a few words, re-reads again, and sends it to print exactly when it’s polished—not before, not after.

— New concept — 🍞 Hook: Imagine a weather app that not only says “70% chance of rain” but also how sure it is about that number.

🥬 The Concept (Predictive Entropy):

What it is: A number that measures how uncertain a prediction is; high entropy means “I’m unsure,” low entropy means “I’m confident.”
How it works:
1. Look at the predicted probabilities across choices (like colors per pixel).
2. Compute average uncertainty (entropy) over the grid.
3. Track how this uncertainty changes across loops.
Why it matters: If uncertainty has dropped and stabilized, the model likely reached a correct, stable answer—so it can stop.

🍞 Anchor: If your coin flip forecast says 50/50, that’s high entropy (unsure). If it says 99/1, that’s low entropy (confident). Low entropy = time to stop flipping.

— New concept — 🍞 Hook: Think of practicing a piano passage: instead of learning 20 different fingerings, you perfect one fingering and repeat it until the piece sounds right.

🥬 The Concept (Looped Transformers):

What it is: A Transformer that reuses the same block over multiple steps to refine its reasoning.
How it works:
1. Start with an initial state from the image and task examples.
2. Apply the same block to update the state.
3. Repeat, letting each pass correct and sharpen the last.
4. After each step, check confidence; continue or halt.
Why it matters: It builds algorithm-like thinking, not just one-shot pattern matching, without growing parameter count.

🍞 Anchor: Like playing a turn-based game where each turn improves your position; you stop when you win.

— New concept — 🍞 Hook: Picture a team: one member plans (big-picture), another does careful touch-ups (details). Together, they finish faster and better.

🥬 The Concept (Weight-Tied Hybrid Block):

What it is: A reusable block that combines global self-attention (rule discovery) with depth-wise convolution (local pixel updates).
How it works:
1. Attention spreads rule information across the whole grid.
2. Convolution edits local neighborhoods to keep shapes crisp.
3. The same block repeats, so rules get applied consistently.
Why it matters: ARC puzzles require both “understand the rule everywhere” and “change pixels precisely here.”

🍞 Anchor: It’s like hearing the coach’s play (attention) and then moving your feet exactly right on the court (convolution).

— New concept — 🍞 Hook: When building a LEGO set, you don’t keep adding pieces after it matches the picture—you stop right there.

🥬 The Concept (Dynamic Exit Mechanism):

What it is: A parameter-free rule that halts the loop as soon as predictions become confident (low entropy).
How it works:
1. After each step, compute predictive entropy across the grid.
2. If it’s below a threshold (e.g., 0.05), stop; else, continue.
3. Cap with a hard max number of steps just in case.
Why it matters: Saves compute on easy tasks and preserves performance on hard ones by giving them more steps.

🍞 Anchor: Like a quiz where you can submit early if you know you got it right, but you can also use the full time if you need it.

Before vs After:

Before: Make the model bigger (more layers/parameters) to try to handle harder puzzles; returns diminish.
After: Keep the model modest, give it more thinking steps when needed, and stop early when done—better accuracy per parameter and smarter compute use.

Why it works (intuition):

Iteration lets the model do hypothesis-testing: guess the rule, try it, check, and refine.
Weight tying encourages learning a general “thought step” that can be safely reapplied.
Hybrid processing mirrors ARC’s nature: global rule + local update.
Entropy halting matches how confidence rises as a good solution forms.

Building blocks: Embedding to tokens, Looped Hybrid Block with attention+conv, step embeddings to track progress, predictive head, entropy early-exit, and optional test-time training to specialize on the given puzzle demos.

03Methodology

High-level overview: Input (grids + demos) → Embed to tokens → Recurrent Core (Hybrid Block) looped T times with shared weights → Predict per-pixel colors each step → Check entropy to possibly exit early → Final output grid.

Step 1: Embed the task

What happens: We take the visual grids (colored pixels) and task context (a few input→output examples) and encode them into a sequence of tokens with positions.
Why this step exists: Transformers work on token sequences. Without a good embedding that preserves where each pixel is, the model would lose spatial structure.
Example: A 10×10 grid with 5 colors becomes 100 image tokens plus a few task tokens that summarize the example pairs.

Step 2: The recurrent transition (the reusable “thought step”)

What happens: We apply the same Hybrid Block multiple times. Each loop updates an internal state z_t. Inside the block:
- Self-attention (with Rotary Positional Embeddings) lets all tokens share information globally.
- A Heterogeneous ConvGLU feed-forward layer applies a 3×3 depth-wise convolution to only the image tokens (not the task tokens), then gates and mixes channels.
- RMSNorm and residual connections keep updates stable across many loops.
Why this step exists: ARC puzzles need both far-away rule sharing (attention) and close-up, pixel-precise edits (convolution). Repeating the same block encourages consistent, algorithm-like updates.
Example: Suppose the rule is “mirror shapes across the vertical axis.” Attention helps spread “mirror-left-to-right” as a global instruction; convolution applies it cleanly to neighbors so edges don’t break.

Step 3: Step embeddings (progress markers)

What happens: Each loop adds a small step-specific embedding e_t so the model knows how far along it is.
Why this step exists: It disambiguates early vs. late stages (explore vs. execute). Without it, the model might repeat the same behavior forever.
Example: Early steps show broad attention (looking around); later steps focus sharply on the parts to change.

Step 4: Predictive head and probabilities

What happens: After each loop, a small head maps the state to per-pixel color probabilities.
Why this step exists: We need explicit predictions to measure confidence and to see if we’re done.
Example: At loop 3, the model outputs a 10×10×C table of probabilities where C is number of colors. Many pixels might still be uncertain (e.g., two colors are close).

Step 5: Dynamic Exit via predictive entropy

What happens: We compute average pixel-wise Shannon entropy H_t. If H_t < τ (e.g., 0.05), we stop; otherwise we continue until a max T_max.
Why this step exists: Different puzzles need different amounts of thinking. This saves compute on easy ones and gives hard ones more time.
Example: A simple “copy the blue block” puzzle might stabilize by step 4; a tricky “fill with gravity” puzzle might need 7–8 steps.

Step 6: Training protocol

What happens: During offline training, we unroll for a fixed T (e.g., 12) and supervise the final step with cross-entropy loss. At evaluation, we do Test-Time Training (TTT): lightly fine-tune the shared weights on the few given demos with small augmentations (rotations, flips, color swaps).
Why this step exists: Fixed-depth training teaches a stable, convergent thought step. TTT specializes that general step to the specific puzzle’s style for faster, surer convergence.
Example: If demos are rotated, TTT helps the model align its internal rule so it crystallizes in fewer loops.

What breaks without each piece:

Without attention: The model can’t broadcast rules; it might fix pixels locally but miss global structure.
Without depth-wise conv: Edges and shapes can smear; rules don’t apply crisply at pixel level.
Without weight tying: Parameters balloon; the model memorizes layer-by-layer tricks rather than a reusable rule.
Without step embeddings: The loop may not transition from exploring to executing.
Without Dynamic Exit: You waste compute on easy cases or stop too early on hard ones.
Without TTT: You lose the last bit of adaptation to the exact demos, hurting convergence.

Secret sauce:

Weight-tied recurrence turns a big model into a small algorithm that can run longer.
Hybrid local+global processing matches ARC’s nature (cellular updates plus rule sharing).
Parameter-free entropy exit makes compute elastic, not fixed.

Mini data walk-through:

Input: Two demo pairs show that red objects fall down like “gravity.” A new grid has a red block floating.
Loop 1–2 (explore): Attention inspects demos; probabilities still fuzzy (higher entropy).
Loop 3–5 (execute): Convolutional updates pull red pixels downward step by step; entropy drops as pixels settle.
Loop 5: H_t < 0.05. The model stops and outputs the settled grid.

04Experiments & Results

The test: Evaluate on ARC-AGI-1 (main) with Pass@2, augmented with RE-ARC for training. Also report ARC-AGI-2 for harder, more symbolic generalization. Measure accuracy, parameters, and compute (GFLOPs), focusing on efficiency.

The competition:

Language-first methods (e.g., Deepseek R1, Claude, GPT-5 variants, Grok-4-thinking) serialize grids to text and use LLM priors.
Recurrent token models (HRM, TRM) operate on 1D sequences of discrete grid tokens.
Vision baselines (VARC, including a 73M-parameter ensemble) solve directly from pixels in a single pass.

The scoreboard (with context):

Loop-ViT Large (18M params): 65.8% on ARC-AGI-1 Pass@2, beating the 73M feed-forward ensemble (60.4%). That’s like getting an A when the bigger class-average got a B.
Loop-ViT Medium (11.2M): 63.8% on ARC-AGI-1; 11.5% on ARC-AGI-2.
Loop-ViT Small (3.8M): 60.1% on ARC-AGI-1, surpassing VARC 18M (54.5%). Think pocket-sized calculator outscoring a desktop one by thinking in more steps.
ARC-AGI-2: Loop-ViT Large reaches 14.2%, improving over vision baselines, though the set remains challenging.

Efficiency findings:

Recurrence dividend: Reusing weights across steps outperforms simply adding more layers. Scaling time (iterations) beats scaling space (parameters) for these tasks.
Dynamic Exit improves the accuracy–compute frontier versus fixed-step loops by skipping unneeded steps on easy puzzles and spending them on hard ones.

Surprising insights:

Small-but-looped > large-but-single-pass in many cases. The 3.8M model clears 60%, showing how iteration unlocks depth without more parameters.
Attention evolves: Early loops are broad (global scanning), later loops become sharp and local (execution), mirroring human problem solving.
Prediction crystallization: Entropy and step-to-step differences drop together across loops, aligning with the idea of converging to a stable attractor.

Ablations and scaling:

Joint scaling (depth B vs. loops T): For tiny cores (B=2), increasing T yields huge gains; for bigger cores (B up to 10), accuracy still rises with T, limited by compute budget.
Hybrid vs. vanilla Transformer: The Hybrid Block (depth-wise conv + attention) consistently wins across depths, confirming the need for local spatial priors in grid reasoning.
Early-exit vs. fixed steps: Entropy-based halting reaches higher accuracy with lower average GFLOPs than a fixed 6-step baseline. Tasks that exit at step 5 are much easier (e.g., ~83% Pass@1) than those needing step 8 (~46%), showing the halting rule tracks difficulty.

Bottom line: Loop-ViT sets a stronger accuracy–compute–parameter Pareto frontier for visual ARC by turning depth into time and making time adaptive.

05Discussion & Limitations

Limitations:

Domain focus: The design is tailored to grid-based visual reasoning like ARC; direct transfer to non-grid or high-resolution continuous tasks may require changes.
Hyperparameters: Loop count ranges, entropy threshold (e.g., 0.05), and block depth need tuning; poor choices can under- or over-think.
Training protocol: Fixed-depth training asks the model to converge by step T; very long reasoning chains beyond T may need curriculum or extended training.
Test-Time Training (TTT): While lightweight, it adds adaptation steps at inference, which some real-time systems may not allow.
ARC-AGI-2: Gains are meaningful but the benchmark remains hard; more symbolic structure may still be needed.

Required resources:

A GPU capable of unrolling several recurrent steps (T up to ~20–28 for the Small model in tests).
Datasets: ARC-AGI-1 and synthetic RE-ARC augmentations.
Modest memory footprint (3.8M–18M params) but variable compute depending on task difficulty.

When not to use:

Ultra-low-latency settings where variable-time inference is unacceptable.
Tasks that are purely perceptual and solved well in one pass (e.g., simple classification on natural images).
Non-spatial problems where the local convolution prior does not help.

Open questions:

Can we learn a task-adaptive entropy threshold or a learned halting policy that keeps the parameter-free simplicity but adapts across domains?
How far can loops extrapolate beyond the training step budget T without instability?
Can richer task tokens or external memory improve ARC-AGI-2 performance by handling more symbolic composition?
How interpretable can the latent state and attention dynamics become—can we expose the internal “algorithm” more directly?
What happens when we mix loops with planning modules or external tools (e.g., differentiable program sketches)?

06Conclusion & Future Work

Three-sentence summary: Loop-ViT is a looped Vision Transformer that separates reasoning time from model size by reusing a Hybrid Block across multiple steps and stopping when predictions become confident. This adaptive, iterative computation beats much larger feed-forward models on ARC-AGI-1 while using far fewer parameters and smarter compute. The model’s attention shifts from global scanning to local execution, and its predictions “crystallize,” confirming genuine step-by-step reasoning.

Main achievement: Proving that scaling time (iterations with weight tying and entropy-based halting) is a more powerful axis than scaling space (parameters) for abstract visual reasoning on ARC.

Future directions:

Stronger symbolic handling to boost ARC-AGI-2.
Learned or task-adaptive halting and curriculum for very long chains of thought.
Extending hybrid recurrence to other visual domains (segmentation cleanup, design tools, robotics perception).
Combining loops with external memory or program-like modules.

Why remember this: Loop-ViT reframes depth as time, shows how small models can think deeply, and introduces a clean, parameter-free way to stop at the right moment—offering a new blueprint for efficient, iterative visual reasoning.

Practical Applications

•Adaptive puzzle solvers that iterate more on tough instances and finish early on easy ones.
•Robotics perception that refines object masks step by step, saving compute when scenes are simple.
•Medical image cleanup (e.g., removing artifacts or filling gaps) with confidence-based early stopping.
•Satellite or aerial imagery analysis where iterative refinement sharpens boundaries under uncertainty.
•Document layout repair (fixing tables, boxes, and lines) with global rule discovery plus local edits.
•Game AI for grid worlds (pathfinding, gravity-like rules) that applies rules iteratively and halts when stable.
•CAD/graphic design tools that snap shapes and symmetries through repeated, precise local updates.
•Anomaly detection that tests hypotheses across loops and stops when a confident segmentation emerges.
•On-device vision (phones, drones) that must budget compute dynamically based on task difficulty.
•Education tools that generate or solve visual reasoning worksheets with step-by-step, explainable updates.

Version: 1