GENIUS: Generative Fluid Intelligence Evaluation Suite

Ruichuan An; Sihan Yang; Ziyu Guo; Wei Dai; Zijun Shen; Haodong Li; Renrui Zhang; Xinyu Wei; Guopeng Li; Wenshan Wu; Wentao Zhang

GENIUS: Generative Fluid Intelligence Evaluation Suite

Intermediate

Ruichuan An, Sihan Yang, Ziyu Guo et al.2/11/2026

arXiv

Key Summary

•The paper introduces GENIUS, a new test that checks whether image-generating AIs can think on the fly, not just recall facts.
•It focuses on Generative Fluid Intelligence (GFI): spotting hidden patterns, following brand-new rules, and adapting when the context changes.
•GENIUS contains 510 expert-made challenges with mixed text-and-image instructions that cannot be solved using memorized knowledge alone.
•The suite measures three things at once: Rule Compliance (did it follow the new rule?), Visual Consistency (did it keep the right identity?), and Aesthetic Quality (does it look good and logical?).
•Testing 12 popular models showed big gaps: even top systems scored around a “D,” meaning they look pretty but often break the rules.
•A key discovery is an “execution gap”: models often understand the instructions (in VQA form) but still fail to draw them correctly.
•The authors trace many failures to messy attention over the context and show a training-free attention fix that boosts scores.
•GENIUS aims to shift AI progress from memorizing to flexible, context-grounded reasoning in visual generation.
•A Large Multimodal Model (like Gemini) can serve as a reliable judge, showing high agreement with human experts.
•The benchmark and a simple attention-adjustment baseline give the community a clear path to build more adaptable, trustworthy generators.

Why This Research Matters

Asking AI to follow brand-new rules is common in real life: designers mix styles, teachers invent symbols for exercises, and storytellers flip physics in fantasy worlds. GENIUS makes sure image AIs can actually do that, rather than just drawing nice-looking defaults. This helps creators trust AI for precise, custom tasks, not just generic art. It also reduces costly mistakes where an image looks great but ignores the client’s instructions. By revealing the gap between understanding and execution, the paper guides engineers toward models that can both reason and render faithfully. In the long run, this means safer, more controllable AI systems for education, media, science, and beyond.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how some kids ace the spelling bee because they memorized every word, but then get stumped when the teacher gives them a brand-new riddle? Memorizing is different from figuring things out on the spot.

🥬 Filling (The Actual Concept)

What it is: This paper studies how image-making AIs handle brand-new, never-seen-before instructions and situations, not just how well they remember things.
How it works (story of the field):
1. Unified Multimodal Models (UMMs) can read text, look at images, and generate images all together.
2. Most tests so far measured Crystallized Intelligence (CI), which is like “what you already know.” If you ask for a cat, a CI-strong model draws a great cat because it saw millions before.
3. But real-world creativity needs Fluid Intelligence: figuring out rules on the fly, adapting to new constraints, and reasoning from just the current context.
4. Until now, we lacked a clean way to test this flexible thinking in generation.
Why it matters: If AIs only memorize, they will fail when a user invents a new symbol, changes the laws of physics in a story, or mixes styles in a custom way. That’s when flexible thinking matters most.

🍞 Bottom Bread (Anchor) Imagine you tell an AI, “In my game, a blue square means ‘shrink the object,’ and a yellow triangle means ‘flip it.’ Now apply blue then yellow to this toy car picture.” A memorizer might draw a nice car, but a flexible reasoner must apply your new symbols correctly.

🍞 Top Bread (Hook) Imagine a Swiss-Army-knife student who can read, draw, and solve puzzles at once. That’s like a Unified Multimodal Model.

🥬 Filling

What it is: Unified Multimodal Models (UMMs) are AI systems that mix text and images as inputs and outputs in a single model.
How it works:
1. They take interleaved text and images (like a comic strip with captions).
2. They build a shared understanding of what’s going on.
3. They then generate a new image (or edit one) based on that context.
Why it matters: Many real tasks use mixed signals—like mood boards plus instructions—so UMMs should handle both together.

🍞 Bottom Bread (Anchor) A designer pastes two reference photos (palette A is loved, palette B is disliked) and says, “Recolor this third image using palette A.” A good UMM reads the words, inspects the pictures, and recolors the third image accordingly.

🍞 Top Bread (Hook) Think of two kinds of smarts: one is your “dictionary brain,” the other is your “puzzle brain.”

🥬 Filling

What it is: Crystallized Intelligence (CI) is stored knowledge (like definitions); Fluid Intelligence (FI) is the ability to solve new problems.
How it works:
1. CI retrieves facts and patterns learned from lots of data.
2. FI builds or adapts rules from the here-and-now context.
Why it matters: Real creativity needs FI, not just CI. Without FI, AI can’t follow brand-new rules or reason under surprises.

🍞 Bottom Bread (Anchor) If you’ve never seen “gravity by color” before, CI won’t help. FI is what lets you say, “So, on this planet, yellow floats and blue sinks? Okay, I’ll draw a yellow pear floating.”

🍞 Top Bread (Hook) You know how detectives pull a clue from a photo and a clue from a note, then connect them to solve the case?

🥬 Filling

What it is: Generative Fluid Intelligence (GFI) is an AI’s ability to generate images by inducing new patterns, following ad-hoc rules, and adapting to fresh context.
How it works:
1. Induce hidden patterns (e.g., “What style does this person like?”).
2. Execute new rules (e.g., “This icon means ‘make it snow’”).
3. Adapt to changed common sense (e.g., “Gravity depends on color”).
Why it matters: Without GFI, models fall back to what they memorized, ignoring the instructions that make the task unique.

🍞 Bottom Bread (Anchor) A teacher says, “On this worksheet, a circle means double the size.” The AI must resize correctly, even if that symbol means nothing outside this worksheet.

🍞 Top Bread (Hook) Before GENIUS, it was like grading a math student only on times tables but never on word problems.

🥬 Filling

What it is: The problem is that most benchmarks tested memory (CI), not on-the-fly reasoning (GFI) for image generation.
How it works:
1. Previous tests often used single images or general world facts.
2. They didn’t force the model to infer new rules from the current context.
3. They didn’t separate “it looks pretty” from “it followed the rule.”
Why it matters: We couldn’t tell if a model truly understood the new instructions or was just good at drawing nice-looking defaults.

🍞 Bottom Bread (Anchor) A model might render a gorgeous snowy city—but if the rule says the special symbol should make it rain, pretty snow is a fail. That subtlety needed a new kind of test.

02Core Idea

🍞 Top Bread (Hook) Imagine a science fair where every project has custom rules posted right next to it, and you must follow those rules exactly to win.

🥬 Filling (The Aha!)

One-sentence insight: The key idea is to evaluate image-generating AIs on Generative Fluid Intelligence by giving them novel, fully context-defined rules and checking whether they can infer, execute, and adapt on the fly.
Multiple analogies:
1. Board game night: Each table invents new rules; good players read the card and instantly play correctly.
2. Cooking with constraints: “Use this spice from photo A, the plating style from photo B, and keep the same pasta shape from photo C.”
3. Science lab: “On Planet X, red objects fall faster and blue float—draw what happens to this blue balloon.”
Before vs After: • Before: Models scored high by recalling common objects and styles; they weren’t pushed to follow brand-new rules. • After: GENIUS forces models to learn from the immediate context and checks both logic and looks.
Why it works (intuition): • If the only way to solve a task is to read the local rules (not world knowledge), then success signals true flexible reasoning. • If a model keeps drawing “default” answers, we catch it with rule-specific checks.
Building blocks:
1. Three GFI primitives (induce patterns, execute ad-hoc constraints, adapt to new context).
2. Five concrete tasks spanning symbolic and visual constraints, metaphor sense, and prior-conflicting worlds.
3. A hybrid evaluation (Rule Compliance, Visual Consistency, Aesthetic Quality) with expert hints to anchor judging.
4. Diagnostics that separate “understanding” from “execution.”
5. A training-free attention intervention that improves focus on critical context tokens.

🍞 Bottom Bread (Anchor) Example: The context defines that a blue square means “remove the nearest tree,” and a green circle means “add a red kite.” The model must apply blue then green to a base photo. GENIUS checks if the tree was removed and the kite added in the right order and style.

🍞 Top Bread (Hook) You know how teachers sometimes give you clues hidden across a paragraph and a picture? You must focus on the important bits and ignore the fluff.

🥬 Filling

What it is: Interleaved multimodal context means the instructions and examples are woven together as text-and-image sequences.
How it works:
1. The task description and examples are spread across both words and pictures.
2. Removing any part (just text or just images) makes the puzzle unsolvable.
3. The model must align the pieces to infer the rule.
Why it matters: This guarantees the answer can’t be guessed from memory; the rule truly lives in the given context.

🍞 Bottom Bread (Anchor) “Use the color palette you like (shown in Image 1), avoid the palette you dislike (Image 2), and recolor Image 3.” Without seeing both palettes and the target image, the model can’t possibly know what to do.

🍞 Top Bread (Hook) Imagine a referee who uses a perfect checklist for every play, plus a style judge who makes sure the game still looks like real sports.

🥬 Filling

What it is: Hybrid evaluation combines three metrics—Rule Compliance (RC), Visual Consistency (VC), and Aesthetic Quality (AQ)—with human-curated hints.
How it works:
1. RC: Did the output follow the exact new rule? (strict nouns, attributes, counts, layouts)
2. VC: Did it keep the right identity or key features from references?
3. AQ: Does it look physically logical and professionally rendered?
4. A Large Multimodal Model (like Gemini-3-Pro) scores with structured prompts; human-verified hints keep it grounded.
Why it matters: A picture can be pretty but wrong. These three lenses prevent “look-good-but-breaks-the-rules” from passing.

🍞 Bottom Bread (Anchor) If a rule says “three apples,” and the image shows two perfect apples with studio lighting, AQ might be high but RC is low—so the model doesn’t get away with it.

03Methodology

🍞 Top Bread (Hook) Think of GENIUS like a puzzle book where each riddle includes tiny pictures and captions that secretly define the rules you must follow.

🥬 Filling (High-level recipe)

What it is: A benchmark suite for Generative Fluid Intelligence with interleaved text-and-image tasks and a judge that scores three aspects.
How it works (pipeline):
1. Input: Interleaved multimodal context (text + images) that defines new rules and a final instruction.
2. Generation: A model produces a new image or edited image.
3. Evaluation: A judge LMM scores Rule Compliance, Visual Consistency, and Aesthetic Quality using expert hints; scores are averaged.
Why it matters: This structure isolates flexible reasoning, not memorized facts, and ties scoring to exact instructions.

🍞 Bottom Bread (Anchor) Example: Context defines “this symbol makes it snow; that symbol makes it rain.” Instruction: “The city in Image 3 encounters Symbol from Image 4—show what happens.” The judge checks for the right weather and preserved city identity.

— Each Step in Detail —

Interleaved Context Construction

What happens: Experts design 510 samples across 3 dimensions and 5 tasks, ensuring the brand-new rules only live in the provided context.
Why this step exists: Without carefully crafted contexts, models could cheat by using memorized associations.
Example: “I like the palette in Image 1, dislike the palette in Image 2; recolor Image 3 accordingly.” Remove any image, and the task becomes undefined.

Task Dimensions and Subtasks

What happens: Three dimensions operationalize GFI: • Implicit Pattern Induction → Implicit Pattern Generation (learn preferences like style, palette, layout) • Ad-hoc Constraint Execution → Symbolic Constraint Generation (icons = rules), Visual Constraint Generation (patches or examples = operations) • Contextual Knowledge Adaptation → Prior-Conflicting Generation (counter-intuitive worlds), Multi-Semantic Generation (literal vs figurative)
Why this step exists: Each dimension tests a unique GFI primitive; together they form a complete picture.
Example: “On this planet, gravity depends on color—draw the yellow pear.”

Hybrid Evaluation with LMM-as-a-Judge

What happens: The judge LMM reads expert hints (gold standards) and the model’s output, then assigns 0/1/2 for RC, VC, and AQ.
Why this step exists: Purely automatic scoring can be vague; expert hints anchor exact rule checks.
Example: RC hint: “three apples,” VC hint: “same character from Image 2,” AQ checks realism and artifacts.

Reliability Checks

What happens: Compare judge scores with human ratings; test a second judge model to verify ranking stability.
Why this step exists: To ensure the metric is trustworthy, not just an artifact of a single judge.
Example: High Pearson correlation with humans; similar rankings with a stricter open-source judge.

Diagnostic Probes: Comprehension vs Execution

What happens: Convert some generation tasks into multiple-choice VQA to test whether the model understood the target outcome.
Why this step exists: If a model knows the right answer in words but can’t draw it, the failure is in execution, not comprehension.
Example: “Which description matches the correct edited image?” Models often answer correctly but mis-generate.

Secret Sauce: Training-free Attention Intervention

What happens: Adjust attention during inference to focus on task-critical tokens without retraining.
Why this step exists: Attention over context was noisy (random spikes), diluting the signal needed for following new rules.
Example Data Walkthrough: • Keyword Distillation: Prompt the model to list which parts matter (e.g., “art style from Image 2,” “shirt from Image 1,” “base is Image 3”). • Relevance Mapping: Compute how strongly each visual token matches the distilled keywords to form a relevance map. • Bias Injection: Add a bias to attention logits so high-relevance tokens get emphasized; softmax then suppresses noise tokens.
What breaks without it: The model keeps paying attention to irrelevant background, so its “implicit gradient” for in-context learning wanders, and the output reverts to defaults.

🍞 Bottom Bread (Anchor) Imagine highlighting the exact steps in a recipe (boil pasta, then add sauce) and dimming the rest of the page. The cook now follows the key steps in the right order—result: the intended dish, not a random one.

04Experiments & Results

🍞 Top Bread (Hook) Picture a championship where teams must follow custom rules for each match. If they play beautifully but break the rules, they still lose.

🥬 Filling (The Test)

What it is: GENIUS evaluates 12 notable models on 510 interleaved-context tasks, scoring Rule Compliance, Visual Consistency, and Aesthetic Quality.
How it works:
1. Each sample is judged three times (stability).
2. Weighted overall scores emphasize rule-following most.
3. Models that accept interleaved input use it; others use a decoupled format.
Why it matters: We need to know who can actually follow new rules, not just draw pretty images.

🍞 Bottom Bread (Anchor) If the rule is “make the dog look noisy using the jagged yellow lines shown earlier,” we check specifically for those lines around the same dog—not just any loud-looking scene.

— Scoreboard with Context —

Top proprietary model (Nano Banana Pro) scored about 57 out of 100—like getting a D when the pass mark is 60.
Strong open-source model Bagel scored about 27—far from passing.
The authors’ training-free attention tweak on Bagel raised scores consistently, showing the fix meaningfully helps execution.
Broad trend: Aesthetic Quality often stays high (images look nice), but Rule Compliance is much lower (logic and rules get broken). That’s the “illusion of competence.”

— Surprising Findings —

Pre-planning and post-reflection barely help: Adding extra reasoning text before or after generation didn’t boost scores much.
Context comprehension helps—but not enough alone: With hints, some models improved a lot, but weaker generators still struggled to execute.
VQA vs Generation split: Models often knew the right answer in multiple-choice form but failed to draw it—evidence of an execution gap.
Contextual Knowledge Adaptation was the hardest: When new rules conflicted with world knowledge, models clung to their priors (cognitive inertia).
LMM-as-a-Judge is reliable: High correlation with human experts; rankings stayed consistent when switching to a stricter open-source judge.
Input format matters: Interleaving images into text (vs separating them) generally helped; fine-grained vs standard interleaving made a smaller difference.
Removing context wrecks rule-following: Without the interleaved examples, scores plunged, proving the tasks really do require the given context.

🍞 Bottom Bread (Anchor) Even when models could tick the right multiple-choice box for “what should the final image look like,” many still drew the wrong number of apples or used the wrong palette—showing they knew the rule but couldn’t execute it reliably.

05Discussion & Limitations

🍞 Top Bread (Hook) Imagine a student who can explain the riddle’s answer out loud but can’t actually build the puzzle in their hands.

🥬 Filling (Honest Assessment)

Limitations:
1. Scope focus: GENIUS emphasizes GFI for image generation; it doesn’t measure every skill (e.g., safety, long-horizon editing pipelines).
2. Coverage: 510 expert-curated samples are rich but cannot span all possible creative, cultural, or domain-specific contexts.
3. Human bias: Expert-made hints and tasks reduce ambiguity but may reflect designer preferences.
4. Judge dependence: Though validated across two judges and humans, automated judging is still an approximation of human review.
Required Resources: • Multimodal models that can ingest interleaved inputs (ideal). • Access to a capable judge LMM (or humans) for evaluation. • GPU resources for inference across 510 cases with three runs each.
When NOT to Use: • If your goal is purely photorealism or artistic flair without rule-following. • If your model cannot process multiple images or interleaved inputs. • If you only need CI testing (e.g., standard text-to-image prompts).
Open Questions:
1. Bridging the execution gap: How can we transfer precise, rule-grounded understanding from the encoder into the image decoder more faithfully?
2. Better attention: Beyond bias injection, can architectures natively align to task-critical tokens and suppress noise?
3. Learning to inhibit priors: Can models learn to “let go” of pretraining defaults when context says otherwise?
4. Stepwise visual reasoning: Can we interleave image drafts with reasoning traces in a stable, scalable way?
5. Generalization: How much data is needed to robustly improve GFI without just learning a new set of memorized tricks?

🍞 Bottom Bread (Anchor) Think of teaching a robot chef: it can read the recipe notes perfectly, but still forgets to add salt at the right step. The next breakthroughs must help it reliably act on what it already understands.

06Conclusion & Future Work

🍞 Top Bread (Hook) Imagine moving from “copying what you’ve seen” to “adapting to whatever happens next.” That’s the heart of this work.

🥬 Filling (Takeaway)

3-sentence summary: GENIUS is a new benchmark that measures Generative Fluid Intelligence in image-making AIs by testing whether they can infer new patterns, follow ad-hoc rules, and adapt to fresh contexts. Across 12 models, even top systems often made pretty but rule-breaking pictures, revealing a major execution gap. The authors traced failures to messy attention over context and introduced a training-free attention adjustment that consistently improved scores.
Main achievement: A precise, theory-grounded standard for testing flexible reasoning in visual generation, plus a simple, effective baseline fix.
Future directions:
1. Architectures that natively prioritize task-critical tokens and inhibit unhelpful priors.
2. Visual chain-of-thought methods that keep reasoning aligned from text to pixels.
3. Broader, culturally diverse tasks and interactive, step-by-step editing challenges.
Why remember this: GENIUS changes the goalposts—from “make it look nice” to “make it logically correct under brand-new rules,” pushing the field toward truly general-purpose, trustworthy creators.

🍞 Bottom Bread (Anchor) The next time you see an AI draw a stunning picture, ask: Did it follow the rules I just invented? GENIUS is how we check.

Practical Applications

•Assess whether your image generator can follow custom, one-off brand or style rules in design workflows.
•Use GENIUS tasks to debug failures where outputs look good but violate instructions (RC vs AQ mismatch).
•Adopt the training-free attention bias method to improve rule-following without expensive retraining.
•Benchmark new multimodal architectures for true adaptability before deployment in production tools.
•Create curriculum-style training that emphasizes inferring rules from interleaved examples.
•Validate model updates with LMM-as-a-judge to maintain consistent, reproducible evaluation.
•Design safer creative assistants that obey precise constraints in advertising or product mockups.
•Build educational exercises where students and teachers define symbols, and the AI must adapt on the spot.
•Prototype scientific visualization tools that follow lab-specific rules (e.g., custom color maps, layouts).
•Develop interactive editing systems that accept small rule cards (icons) and reliably apply them to images.

Version: 1

Key Summary

Why This Research Matters

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes