UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture

Shuo Cao; Jiayang Li; Xiaohui Li; Yuandong Pu; Kaiwen Zhu; Yuanting Gao; Siqi Luo; Yi Xin; Qi Qin; Yu Zhou; Xiangyu Chen; Wenlong Zhang; Bin Fu; Yu Qiao; Yihao Liu

UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture

Intermediate

Shuo Cao, Jiayang Li, Xiaohui Li et al.12/25/2025

arXiv PDF

Key Summary

•This paper teaches AI to notice not just what is in a picture, but how the picture looks and feels to people.
•It introduces UniPercept-Bench, a unified test that checks three things at once: aesthetics (beauty), quality (cleanliness), and structure/texture (patterns and materials).
•It also builds UniPercept, a model trained in two stages so it can both score images (like giving grades) and answer questions about perceptual details.
•A clear three-layer system (Domain → Category → Criterion) organizes the tasks so the AI learns exactly what to look for.
•A special reward trick (Adaptive Gaussian Soft Reward) helps the model learn to predict numbers smoothly instead of guessing wildly.
•Across many tests, UniPercept beats general-purpose models and even specialized models on key perceptual benchmarks.
•UniPercept can act like a 'taste tester' reward for image generators, making their pictures prettier, cleaner, and richer in details.
•Perceptual understanding is different from semantic understanding; this work adds that missing piece so AIs align better with human judgment.
•The benchmark is smaller than big semantic ones, but it’s precise, consistent, and covers areas (like texture richness) that were missing.
•This unified approach makes it easier to evaluate, improve, and control how images look in real apps, from photo editing to AI art.

Why This Research Matters

People don’t just care what’s in a photo; they care how it looks and feels. This work gives AI the missing skills to judge beauty, cleanliness, and texture richness in a single, consistent system. That means photo apps can auto-improve images more cleverly, creators can measure and tune style more precisely, and image generators can learn to make pictures people actually prefer. Because the system is unified, we can compare and combine perceptual dimensions instead of guessing one at a time. In short, it brings AI’s eye closer to a human’s eye and makes everyday image tools smarter and more satisfying.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you can tell a photo is pretty, sharp, and full of neat details even before you name what’s in it? Your eyes judge how it looks and feels, not just what it is.

🥬 The Concept (Perceptual vs. Semantic Understanding): Perceptual understanding is about how an image looks and feels (beauty, cleanliness, texture), while semantic understanding is about what’s in the image (objects, actions, scenes). How it works:

Semantic: find and relate objects (dog, beach, sunset).
Perceptual: judge the look—Is it pleasing? Is it blurry? Are materials and patterns realistic? Why it matters: Without perceptual understanding, an AI can say “a dog on a beach” but still miss that the photo is badly composed, noisy, or has plastic-looking grass. 🍞 Anchor: A postcard of Paris can clearly show the Eiffel Tower (semantics) but still look ugly if it’s tilted, overexposed, and smudgy (perception).

🍞 Hook: Imagine a super student who’s great at naming things in pictures but struggles to grade how good the picture looks. That’s many modern AIs today.

🥬 The Concept (Multimodal Large Language Models—MLLMs): MLLMs are models that read images and text together to answer questions, describe scenes, and reason. How it works:

See: turn image pixels into features.
Read: understand the question or instruction.
Think: combine image and text to respond. Why it matters: They ace semantic tasks (captioning, grounding) but struggled with subtle perceptual judgments like “Is the lighting harmonious?” 🍞 Anchor: An MLLM can say “A boy riding a bike” but might wrongly rate the photo’s sharpness or balance.

🍞 Hook: Imagine judging a drawing contest. You’d look at beauty, neatness, and how detailed the textures are. Three separate but related things.

🥬 The Concept (Three Perceptual Domains—IAA, IQA, ISTA):

IAA (Image Aesthetics Assessment) judges beauty and artistic appeal.
IQA (Image Quality Assessment) checks technical cleanliness (blur, noise, exposure).
ISTA (Image Structure & Texture Assessment) looks at shapes, patterns, materials, and detail richness. How it works:

Each domain focuses on a different question (Is it beautiful? Is it clean? Is it richly detailed?).
Together they cover how images truly look to humans. Why it matters: If you only check one, you miss important parts of how people see images. 🍞 Anchor: A crisp passport photo has high IQA but low IAA; a dreamy art shot may be beautiful (IAA) but simple in texture (low ISTA).

🍞 Hook: People tried lots of partial tests, like only beauty or only quality. But it was like grading math without checking reading.

🥬 The Concept (The Problem and Gap): Previous benchmarks and models focused on semantics or a single perceptual slice. Texture/structure richness wasn’t systematically defined; aesthetics data and quality data lived in separate worlds; many tests turned pictures into words first, skipping true visual perception. How it works:

Scattered tasks = scattered learning.
No shared definition = confusion and inconsistency. Why it matters: Models gave unstable scores and missed fine-grained perceptual cues that humans care about. 🍞 Anchor: Two cameras both say “cat,” but only one photo feels balanced, sharp, and furry-real; old benchmarks couldn’t measure that full difference.

🍞 Hook: Imagine making a report card with three main subjects and clear rubrics for each. Now any judge (human or AI) can grade fairly.

🥬 The Concept (UniPercept-Bench): UniPercept-Bench is a unified benchmark that tests all three perceptual domains (IAA, IQA, ISTA) with a clear three-level taxonomy: Domain → Category → Criterion. How it works:

Define precise criteria (e.g., visual balance, distortion type, 2D contour shape).
Provide two task types: Visual Rating (give a 0–100 score) and VQA (answer targeted questions).
Build datasets with careful generation, filtering, and human refinement. Why it matters: With shared definitions and tasks, models learn and are evaluated consistently across all key perceptual skills. 🍞 Anchor: Questions like “Which area shows the most blur?” or “What material is this?” plus ratings like “Aesthetics: 78/100” let us see exactly what the model gets right.

🍞 Hook: Training a soccer player only on shooting won’t teach them passing. You need drills that match the real game.

🥬 The Concept (UniPercept Model): UniPercept is a strong baseline model trained for unified perceptual understanding using two stages: Domain-Adaptive Pre-Training and Task-Aligned Reinforcement Learning. How it works:

Pre-train on big, carefully chosen perceptual data to learn the “vocabulary” of aesthetics, quality, and texture.
Fine-tune with rewards that match the tasks (scoring and Q&A) to make outputs stable, correct, and human-aligned. Why it matters: Without both stages, models either lack perceptual knowledge or learn to guess numbers poorly. 🍞 Anchor: After training, UniPercept can both rate “IQA: 83/100” and answer “Which area is most blurry? The roller coaster riders.”

Real stakes:

Content creators need pretty and clean images that feel real.
Photo apps must auto-fix blur/exposure and judge results.
Image generators benefit from a “taste tester” reward to produce better pictures.
Datasets can be curated by perceptual quality, not just labels.
Users care about how pictures look, not only what they show.

02Core Idea

🍞 Hook: Imagine a mixing board with three sliders: Beauty, Cleanliness, and Detail Richness. If you can see and control all three, your images sound (and look) just right.

🥬 The Concept (Aha! in One Sentence): Put all perceptual skills under one roof—with clear definitions, joint tasks, and a two-stage training recipe—so a single model can judge and reason about how images look to humans. How it works:

Build a unified benchmark (UniPercept-Bench) with Domain → Category → Criterion.
Train UniPercept with domain-adaptive pre-training to learn perceptual features.
Align it with task-specific rewards so it scores steadily and answers correctly. Why it matters: Before, models were piecemeal and wobbly; now they can consistently evaluate aesthetics, quality, and texture together. 🍞 Anchor: Like a school with a shared grading rubric for art (aesthetics), neatness (quality), and craftsmanship details (texture), plus a champion student trained exactly for those rubrics.

Multiple analogies:

Report card analogy: IAA = art grade, IQA = neatness/legibility, ISTA = craftsmanship/detail. One report, three grades.
Doctor check-up: aesthetics (overall well-being), quality (vital signs), structure/texture (x-ray/skin detail). All needed for a full picture.
Music equalizer: three knobs—beauty, clarity, richness—you can measure and tune together.

Before vs. After:

Before: Benchmarks separated (or missing ISTA), models guessed numbers unstably, VQA didn’t target perception well.
After: A single benchmark covers all perceptual parts, a single model learns them jointly, and results are steady and interpretable.

🍞 Hook: You know how recipes list ingredients and steps so cooks don’t get lost?

🥬 The Concept (Domain–Category–Criterion Taxonomy): A precise three-layer map that turns fuzzy ideas (like “harmony” or “texture”) into checkable pieces. How it works:

Domain: pick IAA, IQA, or ISTA.
Category: zoom into a subtopic (e.g., Visual Elements & Structure; Distortion Type; Geometric Composition).
Criterion: ask a focused question (e.g., 2D contour: Hexagon or Circle?). Why it matters: Without a map, training drifts; with it, models learn exactly what humans mean. 🍞 Anchor: “Which part shows most motion blur?” or “What material is this surface?” are crisp, teachable targets.

🍞 Hook: Practicing vocab before an exam makes answers sharper.

🥬 The Concept (Domain-Adaptive Pre-Training): Teach the model a big, diverse ‘perceptual vocabulary’ across IAA, IQA, ISTA before fine-tuning. How it works:

Curate large datasets aligned to each domain.
Include both text-based QAs and rating-linked pairs.
Add structured ISTA annotations so the model learns geometry/material words. Why it matters: Without this base, the model lacks the building blocks to talk about perception. 🍞 Anchor: It learns words like “glossy,” “noise,” “hexagon,” and how they look in real images.

🍞 Hook: Puppies learn fast with the right treats at the right time.

🥬 The Concept (Task-Aligned Reinforcement Learning): Fine-tune with rewards that match each task so the model stops guessing and starts aligning. How it works:

For VQA: binary reward (right/wrong answer).
For VR: Adaptive Gaussian Soft Reward that gives higher reward the closer the predicted score is to the true one—and lowers smoothly as error grows. Why it matters: Hard thresholds make learning jittery; smooth rewards teach steady scoring. 🍞 Anchor: Predicting 83 when the truth is 85 earns a big treat; predicting 40 earns a tiny one.

🍞 Hook: Reading numbers from a wiggly dial is hard; a digital display is steadier.

🥬 The Concept (Token As Score + GRPO): Use “Token As Score” to make number prediction stable, and optimize with GRPO (a PPO-style method) to keep learning steady. How it works:

Token As Score: map output tokens to numeric ratings, reducing random number swings.
GRPO: a clipped policy-gradient method that controls updates to avoid overshooting. Why it matters: Without stability tricks, numeric ratings can wobble a lot. 🍞 Anchor: Like reading your temperature from a digital thermometer (Token As Score) and adjusting the heater gently (GRPO) so you don’t overshoot.

Building blocks put together:

UniPercept-Bench: the map and the tests.
Visual Rating (VR): continuous scores for IAA, IQA, ISTA.
Visual Question Answering (VQA): targeted questions to prove understanding.
Domain-adaptive pre-training: vocabulary of perception.
Task-aligned RL with soft rewards: stable, accurate scoring and answers.
Result: a unified, dependable perceptual evaluator and reasoner.

03Methodology

High-level flow: Input image → Perceptual feature understanding (Domain-Adaptive Pre-Training) → Task-specific alignment (Task-Aligned RL with rewards) → Output (VR scores and VQA answers)

Stage 0: Building the Benchmark (so training knows what to learn) 🍞 Hook: Imagine writing a study guide before taking the test. 🥬 The Concept (Benchmark Construction Pipeline): Create clean, focused Q&A and scoring data across IAA, IQA, ISTA using a three-step pipeline. How it works:

Initial QA generation: pair images with expert notes and question templates to draft candidate questions.
Reject sampling by another model: auto-check question/answer/criterion validity and toss weak ones.
Human refinement: trained volunteers fix tricky cases and remove any bad pairs. Why it matters: Messy questions teach messy thinking; clean data teaches clear skills. 🍞 Anchor: Example: “Which part shows the most motion blur? A. Ball, B. Bench, C. Floor near towel, D. Back wall.”

Stage 1: Domain-Adaptive Pre-Training (learn the perceptual vocabulary) 🍞 Hook: You learn the alphabet before writing poems. 🥬 The Concept (Perceptual Vocabulary Pre-Training): Feed the model large, diverse data across IAA, IQA, ISTA, including structured ISTA annotations. How it works:

Text-based QA pairs build concept-language links (e.g., “overexposure,” “veined,” “paraboloid”).
VR-linked pairs connect images to numeric ratings so numbers have meaning.
Structured ISTA outputs teach shapes, materials, and patterns in a consistent schema. Why it matters: Without shared terms and examples, the model can’t talk precisely about perception. 🍞 Anchor: Seeing many glossy metal objects helps the model recognize “glossy metal” reliably later.

Stage 2: Task-Aligned Reinforcement Learning (make performance match the tasks) 🍞 Hook: Practice tests with instant feedback help you improve fastest. 🥬 The Concept (Two Reward Types):

VQA: binary reward (1 if correct, 0 if not)
VR: Adaptive Gaussian Soft Reward (biggest when the predicted score is close to the ground truth; smoothly smaller as error grows) How it works:

Sample multiple candidate answers/scores per question (n responses).
Compute rewards per candidate.
Use GRPO (a PPO-style method) to gently push the policy toward better candidates without over-correcting. Why it matters: Smooth numeric rewards prevent jumpy learning; binary rewards keep answers exact. 🍞 Anchor: Guessing 78 for a true 80 earns more reward than guessing 50; correct multiple choice earns a 1.

Key steps (what/why/example):

Step A: Taxonomy labeling (Domain → Category → Criterion) What: Attach precise tags to each question. Why: Prevents confusion (e.g., “rhythm” vs. “balance”). Example: IAA → Composition & Design → Visual Balance.
Step B: Visual Rating formatting via Token As Score What: Convert numeric targets into token outputs to stabilize text-to-number prediction. Why: Direct numbers are noisy; tokens are steadier. Example: Model emits the token sequence that decodes to 83/100, not a free-form sentence.
Step C: ISTA structured annotations What: Decompose scenes into components with base morphology, materials, contours, and volume terms. Why: Teaches the model concrete texture/structure language grounded in visuals. Example: “Component: Roof → Material: Tile, Surface: Matte, Contour: Rectangle, Volume: Prism.”
Step D: Reward shaping with Adaptive Gaussian Soft Reward What: Reward falls off smoothly as numeric error grows; variance adjusts with error. Why: Encourages close-but-not-perfect guesses and reduces brittleness. Example: Error 5 → reward ~0.83; error 20 → ~0.06.
Step E: Multi-task training (VR + VQA together) What: Train both scoring and Q&A in the same loop. Why: They reinforce each other; understanding helps scoring and vice versa. Example: Learning “where is blur?” improves both the answer and the quality score.
Step F: Multi-domain mixing (IAA + IQA + ISTA) What: Train all three domains together. Why: Aesthetics, quality, and texture share visual cues; mixing boosts generalization. Example: Learning “good exposure” (IQA) aids “pleasing light modeling” (IAA).

The secret sauce:

A unified, human-aligned taxonomy that turns fuzzy perceptual ideas into teachable targets.
Smooth numeric rewards (Adaptive Gaussian Soft Reward) that make scoring stable.
Multi-task, multi-domain training that creates a shared perceptual backbone.
Token As Score + GRPO to keep numbers steady and learning controlled.

End-to-end example: Input: A photo of a cat on a moving roller coaster, slightly overexposed.

VQA: “Which part shows most motion blur?” → “The people seated on the roller coaster.”
VR: IAA 59/100 (central cat helps balance but harsh glare), IQA 83/100 (minor distortions), ISTA 64/100 (moderate texture richness).

04Experiments & Results

The test: The team evaluated two abilities—

Visual Rating (VR): give 0–100 scores that match human ratings (measured by SRCC/PLCC, like “how well do rankings and values agree with ground truth”).
Visual Question Answering (VQA): answer multiple-choice or yes/no perceptual questions (measured by accuracy).

The competition: UniPercept was compared to

General MLLMs: GPT-4o, LLaVA-OneVision, InternVL3 series, QwenVL series, GLM-4.5-V, etc.
Specialized scorers: ArtiMuse (aesthetics), DeQA and Q-Insight (quality), others.

The scoreboard (contextualized highlights):

VR (ratings): UniPercept delivered the strongest correlations across Aesthetics (IAA), Quality (IQA), and Structure/Texture (ISTA). For example, on IQA and ISTA, its average correlations are like consistently getting A-level agreement when many general models are closer to C/B levels. On IAA—harder and more subjective—UniPercept still tops generalized models and challenges dedicated systems.
VQA (perceptual Q&A): • IAA VQA overall ~76.6% for UniPercept, clearly above strong baselines (many sit in the 60–70% range). • IQA VQA ~81.1% overall, where UniPercept often exceeds general models by a sizable margin. • ISTA VQA ~84.2%, showing especially strong understanding of geometry/material/structure. In plain terms: if 70% is like a solid B, UniPercept regularly hits into the B+/A− zone across domains, while many others bounce around B- or lower, especially on the hard, fine-grained questions.

Surprising findings:

ISTA seems slightly easier for general models than IAA: physical/geometry/material cues are more objective and align with pretraining priors. Still, UniPercept pushes ISTA even higher by using structured annotations.
“Yes/No” and “Why” questions are easier; “What/Which” (requiring precise local analysis) and “Level Prediction” (fine-grained numeric reasoning) are tougher. UniPercept narrows this gap.
Numeric ratings are hard for general MLLMs (they can be wobbly). Token As Score plus the soft reward makes UniPercept’s ratings steady.
Cross-domain synergy is real: training on IAA, IQA, and ISTA together improves overall performance—even helping in each single domain.

As a reward model for image generation:

Using UniPercept’s IAA reward makes pictures prettier (aesthetics metrics rise).
Using IQA reward makes pictures cleaner and sharper.
Using ISTA reward increases structural/textural richness.
Combining all three rewards gives the best overall gains—like turning all three equalizer knobs to the sweet spot.

Big picture: Across many datasets and comparisons, UniPercept shows that unifying definitions, tasks, and training actually pays off—with stronger, more human-aligned perceptual understanding than previous general or specialized systems.

05Discussion & Limitations

Limitations:

Scale: While broad and carefully defined, UniPercept-Bench is smaller than giant semantic datasets. Even more diverse images and annotators would further strengthen generalization.
Subjectivity: Aesthetics (IAA) can depend on culture and taste; any single score is an approximation of many human opinions.
Coverage: Real-world distortions and exotic materials/styles can be endless; some rare cases may slip outside today’s taxonomy.
Numeric fragility: Even with Token As Score, numeric outputs can still drift without careful prompts and training.

Required resources:

Compute: Multi-stage training (pre-training + RL) benefits from multiple high-end GPUs.
Data: Access to curated IAA/IQA datasets and structured ISTA annotations.
Evaluation: Consistent prompts and controlled inference settings to keep numeric outputs stable.

When not to use:

High-stakes scientific/medical images where domain-specific metrics and experts are required.
Extremely abstract art or conceptual images where aesthetic judgment is intentionally non-standard.
Images with distortions or styles not represented (e.g., rare sensor artifacts), where the taxonomy may misclassify.

Open questions:

Personalization: Can we adapt aesthetics to individual or cultural tastes while keeping objective quality and structure stable?
Explanations: How to generate short, trustworthy, and actionable rationales linked to each sub-score?
Scaling ISTA: What’s the best way to expand structured texture/material annotations across many domains?
Robustness: How to handle adversarial or synthetic artifacts that mimic or hide real distortions?
Control: How to translate perceptual scores into precise editing knobs (e.g., “increase visual balance by X”) in real time?

06Conclusion & Future Work

Three-sentence summary: This paper unifies how AI judges images at a perceptual level—beauty (IAA), cleanliness (IQA), and structure/texture (ISTA)—with a clear benchmark and a two-stage trained model called UniPercept. By combining a precise taxonomy, joint tasks (VR and VQA), and smooth reward learning, UniPercept delivers steadier ratings and smarter answers than previous models. It also serves as a plug-in reward to improve image generators, making pictures look better in ways people actually care about.

Main achievement: Turning scattered, fuzzy perceptual judgments into a single, well-defined, trainable, and testable system—then proving it works better in practice across multiple domains and tasks.

Future directions:

Scale up data and annotators to capture even richer styles, materials, and cultural views of aesthetics.
Add personalized preference modeling and stronger, more interpretable explanations.
Tighten the link from perceptual feedback to controllable image editing and generation.

Why remember this: It closes a big gap between “what’s in the picture” and “how the picture feels,” giving AI a more human-like eye—and giving creators better tools to measure, improve, and steer the look of their images.

Practical Applications

•Improve camera and photo-editing apps with smarter auto-enhance guided by IAA, IQA, and ISTA scores.
•Use UniPercept as a reward to fine-tune image generators toward prettier, cleaner, and more detailed results.
•Filter and curate large image datasets by perceptual quality to train better vision and multimodal models.
•Provide feedback to artists and designers with targeted suggestions (e.g., improve balance, reduce overexposure).
•Monitor quality in content platforms by flagging noisy/blurred uploads or low-aesthetic thumbnails.
•Assist restoration tools by locating distortions (where, how severe, what type) for focused fixes.
•Enable controllable generation: dial up texture richness without harming exposure, or balance composition without losing detail.
•Quality gates in printing or e-commerce pipelines to ensure images meet visual standards before publishing.
•Educational tools that teach photography/composition with immediate, criterion-level feedback.
•Benchmark new multimodal models on perceptual skills, not just semantics, to guide R&D.

Version: 1