Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets

Jialong Zuo; Haoyou Deng; Hanyu Zhou; Jiaxin Zhu; Yicheng Zhang; Yiwei Zhang; Yongxin Yan; Kaixing Huang; Weisen Chen; Yongtai Deng; Rui Jin; Nong Sang; Changxin Gao

Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets

Intermediate

Jialong Zuo, Haoyou Deng, Hanyu Zhou et al.12/17/2025

arXiv PDF

Key Summary

•This paper checks if a popular text-to-image model called Nano Banana Pro can fix messy photos without any extra training.
•Across 14 tasks and 40 datasets, it often makes pictures look nicer to people but scores lower on traditional accuracy numbers like PSNR and SSIM.
•The model sometimes invents tiny details that look real (hallucinations), which helps beauty but hurts pixel-perfect matching to the original.
•In dehazing and deraining, it can sharpen scenes but may change colors or weather, like making overcast skies too blue.
•In super-resolution, it sharpens edges but can accidentally add extra borders or wrong characters in text.
•In motion or defocus deblurring, it can create crisp-looking faces or textures that don’t exactly match the true scene.
•Perceptual metrics (like NIMA, NIQE) often praise its look, while reference metrics (PSNR, SSIM) punish its differences from ground truth.
•Using simple prompts with no fine-tuning, it’s a strong zero-shot baseline for making images look good, not for exact scientific restoration.
•The study argues we need better evaluation methods that balance how good images look with how true they are to the original.
•Future solutions may mix this kind of generative model with physics rules and task-specific constraints to keep beauty and truth together.

Why This Research Matters

Phones, drones, and cars frequently capture messy images in bad weather or low light; this study shows a single general model can quickly make them look nicer without task-specific training. For casual photography and entertainment, that’s a big win: users can get pleasing results with a simple instruction. But for safety-critical tasks like reading traffic signs, identifying people, or medical imaging, the same model can change important details, so caution is essential. The results push researchers to design better metrics that value both human appeal and truthfulness. They also encourage hybrid methods that combine generative creativity with physics constraints for trustworthy outcomes. Overall, this work helps everyone pick the right tool: the generalist for quick, pretty fixes and the specialist for exact, reliable restoration.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you have a smudged class photo. You ask a very talented artist to clean it. The picture might come back beautiful—sharper smiles, brighter colors—but maybe your friend’s freckles moved a bit. It looks great, but it’s not exactly the same.

🥬 The Concept (Low-Level Vision Tasks):

What it is: These are the basic chores of image cleanup—removing haze, noise, rain, blur, and making dark images brighter or clearer.
How it works: Step by step, a system looks at a messy picture, estimates what went wrong (like blur or haze), and tries to undo it so the result matches the original scene.
Why it matters: Without this, cameras struggle in bad weather or low light, and apps (like traffic monitoring or medical tools) can make mistakes.

🍞 Anchor: Think of polishing your glasses when they’re foggy. You’re not trying to make new shapes—you just want to see what’s really there.

🍞 Hook: You know how a creative friend can draw a cat from memory that looks amazing, even if it doesn’t match a real photo of your cat?

🥬 The Concept (Generative Models):

What it is: A generative model is an AI that learns to create realistic images based on what it has seen before.
How it works: It studies tons of pictures, learns patterns (like fur textures or sky colors), and uses that knowledge to produce new, plausible images.
Why it matters: This lets AI fill in missing details, but it can also guess wrong when exact truth is required.

🍞 Anchor: If your pet’s photo is blurry, a generative model may invent whiskers that look right in general—but not your cat’s exact whiskers.

🍞 Hook: Have you ever rolled dice? Even if you try to get a six, you can’t control the exact outcome every time.

🥬 The Concept (Stochasticity of Generative Models):

What it is: Generative models have randomness; the same prompt can produce slightly different results.
How it works: They sample from a learned distribution, so small changes (or even none) still yield varied outputs.
Why it matters: When you need pixel-perfect matches (like before-and-after comparisons), this randomness hurts scores.

🍞 Anchor: Asking the artist to redraw the same scene twice might give you two pretty, but not identical, pictures.

🍞 Hook: Imagine taking a pop quiz on a subject you didn’t study this week—but you’ve paid attention all year.

🥬 The Concept (Zero-shot Evaluation):

What it is: Testing a model on a task it wasn’t specially trained for, using only its general knowledge.
How it works: Give the model an image and a simple text instruction (prompt), no fine-tuning, and see what it does.
Why it matters: It shows how well a big, general model can handle many tasks out of the box.

🍞 Anchor: Telling the model, “Please remove the rain, keep colors the same,” and judging the one-shot result.

🍞 Hook: You know how a drawing can be beautiful to your eyes, but a ruler might say the lines aren’t straight?

🥬 The Concept (Perceptual Quality vs. Pixel Fidelity):

What it is: Perceptual quality means it looks good to people; pixel fidelity means it closely matches the original image pixel-by-pixel.
How it works: Perceptual metrics reward natural look and aesthetics; fidelity metrics punish any differences from the ground truth.
Why it matters: A model can be loved by humans but score poorly on strict math-based tests.

🍞 Anchor: A retouched photo may look stunning but won’t match the exact pixels of the original scene.

🍞 Hook: Think of two scorecards for pictures—one is a beauty judge, the other is a microscope.

🥬 The Concept (Common Metrics: PSNR/SSIM vs. NIMA/NIQE):

What it is: PSNR and SSIM check how close the output is to the ground-truth pixels; NIMA and NIQE score human-like appeal or naturalness.
How it works: PSNR/SSIM compare images directly; NIQE looks at statistical naturalness; NIMA predicts human aesthetic ratings.
Why it matters: A high beauty score doesn’t guarantee high pixel-matching, and vice versa.

🍞 Anchor: The paper finds Nano Banana Pro often gets better beauty/naturalness scores but lower pixel-accuracy scores.

The world before: For years, task-specific neural nets ruled low-level vision. They were trained on carefully paired data (noisy vs. clean, blurry vs. sharp), and scored high on PSNR/SSIM by sticking closely to the ground truth. But collecting perfect pairs for real-world messiness (like true haze or complex motion blur) is hard. Labs built synthetic datasets with simplified assumptions, and models sometimes failed in the wild.

The problem: Could a powerful, general, text-to-image model—great at creating beautiful images—also serve as a one-model-for-everything fixer without any special training? And if it looks great but doesn’t match pixels, how should we judge it?

Failed attempts: Physics-inspired tricks (like handcrafted haze rules) often break on tricky scenes. CNNs and Transformers trained on synthetic pairs do well on lab tests but can over-smooth textures or fail on real-life conditions. Diffusion models improved looks but got slower or still fought the beauty-vs-accuracy trade-off.

The gap: No one had carefully, broadly checked whether a commercial, generalist generator could handle 14 classic cleanup tasks in zero-shot with simple prompts. And the field lacked a fair way to compare “looks good” vs “matches truth.”

Real stakes: This matters for your phone’s night photos, dashcams in rain, drones in fog, reading signs in motion, and restoring old family photos. If a single general model can do “pretty good” everywhere without training, that’s a huge time-saver—but only if we also know when not to trust it. This paper shows both the promise (beautiful, plausible images) and the risk (hallucinations and color shifts) so people can choose the right tool for the job.

02Core Idea

🍞 Hook: Imagine two coaches teaching you basketball. One coach demands you hit the exact same spot on the backboard every shot (accuracy). The other coach wants your shots to look smooth and stylish (style). You can be stylish and still miss the exact spot.

🥬 The Concept (Aha! Insight):

What it is: A single, general generative model (Nano Banana Pro) can clean up many kinds of messy images in zero-shot using just text prompts, often making them look great to people—but it won’t necessarily match the exact pixels of the original, so traditional scores go down.
How it works: The model uses huge visual priors learned from the internet to fill in plausible details and adjust scenes based on the prompt, without task-specific training.
Why it matters: It challenges how we judge success. If a picture looks convincingly real, should a small pixel mismatch ruin its score?

🍞 Anchor: Asking, “Remove the rain but keep colors,” may give you a crisp, nice-looking photo that’s not an exact twin of the ground truth.

Three analogies for the same idea:

Artist vs. Tracer: The artist redraws a smudged picture beautifully (perceptual win) but not identically (fidelity loss). The tracer copies every line exactly (fidelity win) but may look boring (perceptual loss).
Chef vs. Chemist: The chef plates a dish that tastes amazing to people (perceptual win), while the chemist wants the ingredients measured to the microgram (fidelity win). The chef can please diners even if the recipe doesn’t match the lab notes.
Map vs. Photo: A map is neat and useful but not a pixel copy of the land. A satellite photo matches pixels but can be messy. The generalist model makes “maps” that people find clean and helpful but not exact photos.

Before vs After:

Before: We assumed top scores mean best images. Specialized models trained on pairs dominated fidelity metrics; generative models were rarely evaluated broadly across many low-level tasks without tuning.
After: We now see a general model can be a decent all-rounder in zero-shot, often winning in human-like appeal but losing in pixel-true metrics. This suggests two lanes of success: perception and fidelity.

Why it works (intuition, not equations):

The model has strong priors—like memory of how skies, bricks, leaves, and faces usually look.
When input is damaged (noise, blur, haze), exact truth is partly missing. The model fills gaps with its best guess from experience.
That guess is often pleasing and coherent to humans (perception) but shifts pixels from the ground truth (hurts PSNR/SSIM).

🍞 Hook: You know how a jigsaw puzzle with missing pieces forces you to guess the picture from the box image?

🥬 The Concept (Building Blocks):

What it is: The evaluation stacks several simple pieces to test the idea.
How it works, step by step:
1. Use simple prompts (like instructions) for each task (e.g., “remove rain”).
2. Feed the degraded image and prompt into the model once (zero-shot, no extra training).
3. Resize outputs if needed and compute both fidelity metrics (PSNR/SSIM) and perceptual metrics (NIQE, NIMA, LPIPS).
4. Compare against specialist models across 14 tasks and 40 datasets.
Why it matters: This clean setup shows the honest gap between looking great and matching perfectly.

🍞 Anchor: Like giving the same paint-by-number picture to an artist and a careful color-filler, then grading both with two rubrics: beauty and accuracy to the numbers.

03Methodology

At a high level: Input (a degraded image) → Add a short text instruction → Nano Banana Pro generates a cleaned image → Compare the result with ground truth using multiple metrics and human-like scores.

🍞 Hook: Think of giving a helper robot a smudged photo and a sticky note that says, “Please remove the smudge, don’t change colors.”

🥬 The Concept (Zero-shot Prompted Restoration Pipeline):

What it is: A simple recipe that uses fixed, plain-language prompts to ask a general model to fix images without any extra training.
How it works (recipe-style):
1. Choose a task and dataset (e.g., dehazing on RTTS). No fine-tuning.
2. Write a short, fixed prompt for the task (e.g., “Remove haze, keep colors and lighting natural.”).
3. Send the degraded image plus prompt to the model once (no multi-round cherry-picking in the main protocol).
4. If the model’s output resolution differs, resize it to match the dataset’s ground-truth size for fair metrics.
5. Compute fidelity metrics (PSNR/SSIM) and perceptual metrics (NIMA, NIQE, LPIPS, MS-SSIM when available).
6. Compare against strong specialist baselines on each dataset.
Why it matters: Without this disciplined setup, we could accidentally over-tune prompts or pick only the best samples, which would not be a fair zero-shot test.

🍞 Anchor: It’s like giving every competitor the same instructions and timing them in the same race, then scoring both speed (fidelity) and style (perception).

Each step detailed with examples:

Input selection: 14 tasks, 40 datasets spanning restoration (dehazing, deraining, deblurring, denoising, reflection removal, super-resolution), enhancement (low-light, underwater, HDR), and fusion (multi-focus, infrared-visible). • Why this step: Breadth prevents cherry-picking easy cases and shows real generalization. • Example: For super-resolution, use DIV2K-Val (synthetic), RealSR and DRealSR (authentic).
Prompting: Plain, fixed prompts like “This is a rainy image. Please remove rain streaks and raindrops while keeping all other elements, the original color tone, lighting, and atmosphere unchanged.” • Why this step: Keeps the test honest—no special magic sentences per image. • Example: In deraining, the same sentence was applied across Rain200L/H and SPA-Data.
One-pass generation: Use the model as-is, no task-specific fine-tuning, and (overall) no cherry-picked outputs. Some sub-studies note occasional retries when the output was clearly invalid, but the core protocol stays conservative and simple. • Why this step: Shows true zero-shot ability. • Example: In dehazing and super-resolution, they didn’t hunt for the best sample; they used a fixed prompt.
Resolution handling: Outputs often around 1K resolution; they are resized (e.g., bilinear) to match ground-truth for metric comparisons. • Why this step: PSNR/SSIM require pixel alignment. • Example: Denoising outputs are resized to the dataset’s GT size before computing PSNR/SSIM.
Metrics: Dual-lane evaluation. • Fidelity: PSNR, SSIM (sometimes LPIPS as a perceptual distance but still GT-referenced). • Perceptual: NIQE (no-reference naturalness), NIMA (aesthetic score), MS-SSIM, LPIPS. • Why this step: Captures the beauty vs truth trade-off. • Example: In super-resolution, Nano Banana Pro had low PSNR/SSIM vs baselines but consistently strong NIQE.
Baselines and comparisons: Include top classic and modern models—GANs, Transformers, diffusion, and physics-guided. • Why this step: Provides context (is a 21 dB PSNR good or not?). • Example: In deraining, compared with Restormer, NeRD-Rain, and others; in motion deblur, compared with Restormer, Uformer, HI-Diff, ID-CDM.

The secret sauce: The paper’s cleverness isn’t a new algorithm—it’s the breadth and fairness of the evaluation. By holding prompts simple and avoiding tuning, the study reveals the honest, natural behavior of a commercial, general generative model across many classic tasks. This exposes a consistent pattern: great-looking results with plausible details, but weak pixel-level alignment.

Concrete data examples across steps:

Dehazing: On RTTS and Fattal’s, Nano Banana Pro often earned top NIMA (aesthetic) but worse BRISQUE/FADE in some sets, with visual over-enhancement like overly blue skies.
Super-resolution: On DIV2K-Val and real datasets, PSNR/SSIM trailed by large margins (sometimes >4 dB), but NIQE was best-in-class, suggesting clean, natural-looking textures—though sometimes hallucinated, with odd boundary expansion or mistaken text.
Deraining: On Rain200H, PSNR around 21.10 dB vs >32 dB for SOTA; still, global structures (like bridge cables) were plausibly reconstructed.
Motion/Defocus deblurring: Faces and text could look sharper but be semantically wrong (identity swaps, wrong characters), and true blur kernels weren’t inverted.
Denoising: Substantially lower PSNR/SSIM vs task specialists; sometimes crisp text but color shifts or texture loss.
Reflection removal: Lower across PSNR/SSIM and perceptual LPIPS/MS-SSIM vs modern SOTA; often due to stylistic shifts or incomplete separation.

Why each piece exists and what breaks without it:

Fixed prompts: Without them, we might overfit prompts per image, hiding true zero-shot behavior.
Wide dataset coverage: Without it, results could be biased by a few easy tasks.
Dual-lane metrics: Without perceptual metrics, we’d miss the model’s beauty strengths; without fidelity metrics, we’d miss its truth gaps.
Resizing alignment: Without matching sizes, fidelity metrics would be unfair or meaningless.

Put together, the method is a clean, transparent test bench that surfaces the same story across many tasks: Nano Banana Pro is a great “make it look nice” zero-shot helper, but not a “recreate the exact truth” specialist.

04Experiments & Results

The test: Measure both how good images look to humans (perception) and how exactly they match the ground truth (fidelity). The question: Can one general model, used with plain prompts and no fine-tuning, be a low-level all-rounder?

The competition: Top specialist models per task—CNNs, Transformers, GANs, diffusion, physics-informed methods. They are usually trained on paired data and often top PSNR/SSIM.

The scoreboard with context:

Dehazing (RTTS, Fattal’s): Nano Banana Pro scored top-tier NIMA (aesthetics) but mixed BRISQUE and FADE; visual wins included clear edges under heavy haze; failures included color fidelity (over-blue skies). Translation: It’s like painting a sky that looks “clear,” but not the sky that was really there.
Super-Resolution (DIV2K-Val, RealSR, DRealSR): PSNR/SSIM trailed GAN/diffusion specialists (often by >4 dB on DIV2K-Val), but NIQE (naturalness) was consistently best. Visuals showed sharper geometry but also unintended field-of-view expansion, hallucinated textures, and wrong characters when text was too degraded. Translation: The pictures look clean and crisp, but the microscope points out mismatches to the original.
Deraining (Rain200L/H, SPA-Data): PSNR/SSIM much lower than SOTA; however, the model sometimes restored globally coherent structures better than some baselines. Failures included confusing rain with haze and altering background tones. Translation: It dries the window well enough for a nice view but occasionally repaints the scene to its taste.
Motion Deblurring (GoPro, HIDE, RealBlur-J/R): Much lower PSNR/SSIM vs best methods; perceptually sharper on static text and edges, but identity changes in faces, ghosting on complex motion, and altered lighting textures. Translation: The scene pops, but it might not be the same person or the same neon sign.
Defocus Deblurring (DPDD, RealDOF): Large gaps to SOTA on PSNR/SSIM; the model often boosted contrast rather than truly inverting defocus blur. Sometimes added artifacts when trying to sharpen. Translation: It brightens and crisps up the image but doesn’t really “undo the lens.”
Denoising (McMaster, Kodak24, Urban100; SIDD, PolyU): Substantial deficits to specialists across both synthetic and real sensor noise; occasional wins in text clarity but frequent color shifts and lost micro-textures. Translation: It cleans noise in a way that pleases the eye but not the measuring tools.
Reflection Removal (Real20, Nature, SIR² subsets): Lower than recent SOTA in both fidelity and perceptual metrics (LPIPS, MS-SSIM), indicating not only misalignment but also stylistic drift and incomplete layer separation. Translation: It can reduce reflections, but sometimes merges them into the scene or changes the style.

Surprising findings:

NIQE leadership in super-resolution: Despite low PSNR/SSIM, best NIQE implies very natural-looking statistics, possibly from strong priors suppressing artifacts.
Systematic hallucinations: Recurrent patterns like over-blue skies, boundary expansions, wrong characters, and identity shifts appeared across tasks, not just in one.
Semantic bias in rain vs. haze: The model frequently over-removes haze when told to remove rain, likely due to entangled concepts in its training.

Big picture: Nano Banana Pro often produces images that humans might prefer at a glance but fails under microscopes that check exactness. It’s a strong zero-shot starter, not a replacement for task-specialist models where truth-to-original is critical.

05Discussion & Limitations

Limitations (be specific):

Pixel fidelity gap: The model’s stochastic, generative nature creates plausible but non-identical details, so PSNR/SSIM are consistently lower than specialists.
Semantic drift: It may change meaning—wrong text characters, shifted identities, altered lighting or weather (e.g., turning gray skies into blue).
Physical inconsistency: It doesn’t invert real camera physics (e.g., defocus PSF, motion kernels), so deblurring and deraining can be stylistic rather than faithful.
Resolution and alignment: Output resizing and field-of-view shifts hurt reference metrics and can change composition.
Instruction sensitivity: Simple prompts help, but subtle wording can nudge the model toward over-enhancement or over-cleaning.

Required resources:

Access to the commercial API and enough compute/bandwidth for many images.
Careful data handling to resize outputs and compute metrics fairly.
For broader use, storage for multi-dataset evaluations and possibly human assessments.

When NOT to use it:

Forensic or medical imaging where pixel-true recovery is critical.
Traffic signs, license plates, or documents where exact text must be preserved.
Scientific measurements (astronomy/microscopy) where quantitative fidelity matters.
Benchmark submissions judged purely by PSNR/SSIM.

Open questions:

Can we design new evaluation metrics that combine human preference with a penalty for semantic errors (e.g., wrong text or identity)?
What hybrid methods best mix generative priors with physics constraints (e.g., atmospheric scattering, blur kernels) to tame hallucinations?
How far can prompt tuning or lightweight adapters go before we lose the zero-shot spirit?
Can we guide stochastic sampling to favor fidelity while keeping perceptual quality?
How should we detect and prevent semantic shifts (e.g., identity or text) reliably in restoration pipelines?

06Conclusion & Future Work

Three-sentence summary: This paper tests Nano Banana Pro, a general, commercial text-to-image model, as a zero-shot fixer across 14 low-level vision tasks and 40 datasets. It often makes images look cleaner and more natural to people, but it lags behind specialist models on pixel-accuracy metrics and sometimes changes scene meaning. The study shows we need evaluation and method designs that blend human appeal with truth to the original.

Main achievement: A broad, careful benchmark that reveals a consistent perception-versus-fidelity split for a generalist generative model across many classic tasks without fine-tuning.

Future directions:

Build hybrid systems combining generative priors with physics and geometry to reduce hallucinations.
Develop perception-aligned metrics that also penalize semantic mistakes.
Explore prompt engineering and light adapters to improve fidelity without losing zero-shot versatility.
Add guardrails (text/face consistency checks) to prevent semantic drift.

Why remember this: It’s a reality check for the growing belief that one big model can do everything. Generalist generators are strong zero-shot stylists—they can make images look great—but they still need help to be trustworthy restorers when every pixel counts.

Practical Applications

•Quick photo cleanup for social media: remove haze or noise with a single prompt to get pleasing images fast.
•Upscaling old family photos for printing where a natural look matters more than exact pixel matching.
•Enhancing casual action shots (e.g., sports) for visual sharpness while acknowledging possible identity/text drift.
•Preprocessing webcam or livestream visuals to look cleaner under poor lighting for non-critical use.
•Rapid prototyping in vision labs: use zero-shot results as a baseline before building task-specific models.
•Creative content editing: stylized dehazing/deraining that intentionally favors aesthetics.
•Designing new evaluation dashboards that show both perception (NIMA/NIQE) and fidelity (PSNR/SSIM) scores side by side.
•Hybrid pipelines: use the generative model for plausible detail plus a physics or specialist module to enforce truth on important regions (faces, text).
•Prompt engineering libraries: curated prompts that reduce over-enhancement and color drift for each task.
•Quality control flags: automatic detectors to warn about semantic shifts (e.g., altered text or faces) before deploying results.

Version: 1