Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Li-Zhong Szu-Tu; Ting-Lin Wu; Chia-Jui Chang; He Syu; Yu-Lun Liu

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Beginner

Li-Zhong Szu-Tu, Ting-Lin Wu, Chia-Jui Chang et al.12/24/2025

arXiv PDF

Key Summary

•The paper builds YearGuessr, a giant, worldwide photo-and-text dataset of 55,546 buildings with their construction years (1001–2024), GPS, and popularity (page views).
•It shows that many vision-language models (VLMs) do much better on famous buildings than on ordinary ones, meaning they often memorize landmarks instead of truly understanding architecture.
•The task is treated as ordinal regression (predicting a number in order), which fits years better than plain classification and improves precision.
•A new benchmark measures error in years (MAE), accuracy within time windows (like ±5 or ±100 years), and a popularity-aware score to expose memorization bias.
•YearCLIP, a CLIP-based method with GPS fusion and simple ‘reasoning prompts’ (like roof or wall type), provides human-checkable explanations and reduces error versus strong baselines.
•Closed-source VLMs get the lowest errors overall but show the strongest popularity bias, performing up to 34% better on highly viewed landmarks.
•Models are best on modern buildings and worst on very old ones, and they perform unevenly across continents due to data and pretraining imbalances.
•Renovated or rebuilt structures confuse most models, because the visible style can mismatch the original construction year.
•The work offers an open, reproducible benchmark to help researchers build models that understand architecture rather than just memorizing it.

Why This Research Matters

Most buildings on Earth are not world-famous, but they still need care, funding, and safety checks. If our models are mostly good at famous landmarks, we risk making poor decisions for the ordinary buildings we live and learn in. YearGuessr exposes this gap and offers tools to fix it by treating time as an ordered value, adding GPS hints, and checking for popularity bias. With explainable predictions, city planners and historians can trust and verify how a model reached its answer. Over time, this can improve heritage preservation, safer retrofits, and fairer AI that doesn’t just memorize the internet’s favorites.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re playing a guessing game with your family photos. Some are famous holiday spots you’ve seen on postcards, and others are your grandparents’ house. It’s easy to guess the famous ones because you’ve seen them before, but the everyday homes are trickier.

🥬 The Concept (Machine Learning): Machine learning is when a computer learns patterns from examples so it can make good guesses later. How it works:

Show it lots of labeled examples (like building photos with their years).
Let it spot patterns that connect looks (images) and facts (years).
Test it on new photos to see if it learned or just memorized. Why it matters: If the model only memorizes, it stumbles on photos it hasn’t seen before. 🍞 Anchor: If a model has only seen the Eiffel Tower, it might ace that but freeze on a new church in a small town.

🍞 Hook: You know how you can rank your favorite songs from 1 to 10? That order means something.

🥬 The Concept (Regression vs. Ordinal Regression): Regression predicts a number (like a building’s year). Ordinal regression respects the idea that some numbers are closer than others and uses ordered steps to predict them. How it works:

Treat the answer (a year) as an ordered value.
Learn to say earlier vs. later with many small comparisons.
Combine these to get a final year guess. Why it matters: Without ordinality, a mistake of 2 years and 200 years look the same in a simple “wrong/right” world, which isn’t fair. 🍞 Anchor: Guessing 1902 instead of 1905 is a small miss; calling it 1500 is a big miss—ordinal regression knows the difference.

🍞 Hook: Think of a camera and a book club working together.

🥬 The Concept (Computer Vision + Natural Language Processing): Computer vision understands pictures; natural language processing understands text. Together, they can connect what they see (arches, domes) with what they read (descriptions). How it works:

Vision reads the image; language reads captions or prompts.
They meet in a shared space where meanings match.
The model uses both to decide. Why it matters: Photos alone may miss context; words alone can be vague. Together, they’re stronger. 🍞 Anchor: A facade photo plus “Gothic church in France” is way more helpful than either alone.

🍞 Hook: A treasure map needs coordinates to find X.

🥬 The Concept (Geospatial Analysis): Geospatial analysis uses GPS (latitude/longitude) to add place-based hints. How it works:

Encode GPS into numbers the model understands.
Mix these with image clues.
Use regional styles as hints (e.g., domes common in region X). Why it matters: Some styles cluster in places; without location, the model may confuse look-alikes from different regions. 🍞 Anchor: A white dome in Turkey vs. in the U.S. suggests different likely time periods.

🍞 Hook: When you grade a test, a 2-point mistake is not the same as a 50-point mistake.

🥬 The Concept (Statistical Evaluation Metrics): Metrics are score-keepers that tell us how well the model predicts. How it works:

Mean Absolute Error (MAE) tells the average “how many years off.”
Interval Accuracy (IA) says if the guess is within certain windows (±5, ±100 years).
Popularity-aware scores compare results for famous vs. ordinary buildings. Why it matters: Without good scoring, we don’t know if models truly learned or just guessed. 🍞 Anchor: Saying “MAE = 40” means the model is off by 40 years on average—like a report card for time guessing.

🍞 Hook: In the cafeteria, everyone lines up for the most popular pizza slice, and the other dishes get ignored.

🥬 The Concept (Popularity Bias): Popularity bias is when models do better on famous things they’ve seen a lot online, and worse on less-known things. How it works:

Training data on the internet contains many famous landmarks.
The model memorizes these and becomes very confident about them.
On everyday buildings, it has fewer examples and struggles. Why it matters: It looks smart on popular items but underperforms where help is most needed. 🍞 Anchor: The model may nail the Taj Mahal’s year but miss a small-town library.

The world before this paper: Many models could label building style or location roughly, and some could guess building ages, but datasets were small or limited to a region or to modern times. Worse, evaluations often acted like multiple-choice tests, ignoring the ordered nature of time. And no one had a large, open dataset connecting images, years, GPS, and a popularity measure to check if models were really understanding or just memorizing.

The problem: Do vision-language models truly understand architectural cues, or are they just recalling famous landmarks from pretraining? And how should we fairly evaluate year prediction, where time is continuous, ordered, and errors aren’t all equal?

Failed attempts: Prior datasets were geographically narrow (like mostly Western Europe), only covered modern years, lacked images, or were closed. Many treated the task as classification (“pick a bucket”), losing the ordered timeline feeling. This made models look better than they really were and hid memorization.

The gap: We needed a worldwide, open, image-based benchmark with continuous years, GPS, rich text, and a popularity signal to expose memorization—plus fair, ordinal-aware metrics.

Real stakes: City planners, historians, insurers, and disaster teams need reliable age estimates. If models only shine on famous places, they fail where we need them most—ordinary buildings that make up our neighborhoods, schools, and homes.

02Core Idea

🍞 Hook: Think of guessing someone’s age from a photo. It helps to know both how they look and where they grew up, because styles change by place and time.

🥬 The Concept (Vision-Language Models, VLMs): VLMs read pictures and words together to answer questions. How it works:

Turn images and text into vectors (numbers).
Compare and combine them to understand meaning.
Use the fused understanding to predict. Why it matters: Many real problems need both seeing and reading—like understanding buildings and their history. 🍞 Anchor: A VLM can look at a church photo and also read, “Baroque facade in Italy,” then guess a time period.

Aha! moment in one sentence: Use a big, open, worldwide dataset with years (as an ordered number), GPS, and popularity to show that top models often memorize famous buildings instead of truly understanding architecture—and then provide a simple, explainable CLIP-based method (YearCLIP) that does ordinal regression and gives human-checkable reasons.

Three analogies:

Flashcards vs. Understanding: If you only memorize answers to famous questions, you ace those but freeze on new ones; true learning means you can handle fresh, ordinary problems.
Thermometer vs. Color Labels: Time is like temperature; predicting exact degrees (years) makes more sense than sorting into color bins (centuries).
Map + Photo: A travel guide (GPS) plus a photo helps you date a building better than either alone.

Before vs. After:

Before: Small or narrow datasets, classification buckets, no popularity signal, and unclear whether models were truly reasoning.
After: YearGuessr (55k buildings, 157 countries, 1001–2024) with ordinality, GPS, captions, and page views exposes popularity bias; YearCLIP shows how to blend vision, GPS, and simple reasoning prompts for explainable predictions.

🍞 Hook: When you arrange books by year of publication, nearby books are closely related.

🥬 The Concept (Ordinal Regression): Ordinal regression predicts values that have a meaningful order, like years. How it works:

Split time into coarse periods (e.g., Gothic to Modern).
Use text prompts to anchor each period.
Refine to a fine-grained year by combining similarities and probabilities. Why it matters: It rewards being close on the timeline, not just exact matches, making learning smoother. 🍞 Anchor: Being within ±5 years is a small miss; the model gets credit for being in the right era.

🍞 Hook: You know how you might say, “It has a dome and stone walls,” to explain your guess?

🥬 The Concept (Reasoning Prompts): Reasoning prompts are short, pre-written clues about features (roof, wall, height) that the model can match against the photo. How it works:

Create a menu of possible features (e.g., dome, spire).
Encode them with a text encoder (like CLIP).
Let the model say which features fit best and why. Why it matters: Explanations help humans verify and catch mistakes. 🍞 Anchor: “I predict 1700 because I see a dome, stone walls, and Baroque-like symmetry.”

🍞 Hook: GPS is like saying, “We’re in Italy,” which changes what styles and years are likely.

🥬 The Concept (Geographic Encoding): GPS coordinates are turned into helpful numeric hints and fused with the image features. How it works:

Encode latitude/longitude.
Learn a fusion (via zero-convolution) so the network decides how much to trust location.
Use both when available; fall back to image-only when not. Why it matters: Place helps disambiguate look-alike styles that spread globally. 🍞 Anchor: A dome in Rome and a dome in Nevada have different likely ages.

Why it works (intuition without equations):

CLIP’s vision-text space already knows a lot about visual-language matches; adding ordered time (ordinal regression) nudges it to respect the timeline.
GPS gives priors: some styles and eras cluster by region.
Reasoning prompts act like labeled arrows pointing to key visual cues (roof, wall, symmetry), strengthening the model’s attention to what matters and producing checkable rationales.

Building blocks:

YearGuessr: 55,546 images, 157 countries, 1001–2024, GPS, captions, page views.
Metrics: MAE (how many years off), IA (inside ±5/±20/±50/±100), and popularity-aware IA (separate famous vs. ordinary).
YearCLIP: CLIP backbone + ordinal training (coarse-to-fine) + GPS fusion via zero-convolution + a bank of reasoning prompts for explainability.

03Methodology

At a high level: Input (building photo + optional GPS) → Image encoder (CLIP) → Location encoder (GPS to features) + Zero-conv fusion → Text encoders for style classes and reasoning prompts → Regressor for coarse-to-fine ordinal year → Output (predicted year + rationale).

Data creation (YearGuessr)

What happens: Crawl Wikipedia/Wikimedia categories of buildings (1001–2024), grab images, years, GPS, captions, and page views (popularity). Deduplicate, filter non-facades via CLIP similarity to “a building facade,” and lightly audit the test set. Stratify splits by decade and continent.
Why it exists: We need a global, open, continuous-years dataset with GPS and popularity to check for memorization.
Example: St. Anthony’s Shrine in Sri Lanka with coordinates (6.9469°N, 79.8561°E), a year in 1800s, and page views as popularity.

Problem setup: ordinal regression

What happens: Treat the construction year as an ordered value. Create coarse historical periods (e.g., Roman 800–1150, Gothic 1150–1400, …, Contemporary 1950–present) and learn to map images to these, then refine to a specific year.
Why it exists: Classification ignores closeness in time; naïve regression wastes the structure of periods/styles.
Example: A guess at 1708 is rewarded as close to 1681 compared to guessing 1500.

Metrics and protocol

What happens:
- MAE: average years off.
- Interval Accuracy (IA): did the guess land inside ±5, ±20, ±50, ±100 years?
- Popularity-aware IA: split test results by page-view bins (e.g., <10k views vs. >10k), report gains or drops.
Why it exists: To fairly measure fine-grained dating and to expose memorization on famous buildings.
Example: A model that jumps from 24% to 58% IA±5 on very popular buildings likely memorized them.

YearCLIP model

Image encoder (CLIP):
- What: Turn the 224×224 facade into a meaningful vector.
- Why: CLIP aligns images and text, helpful for style cues.
- Example: Windows with arches, stone texture, and symmetry shape the vector.
Location encoder + zero-conv fusion:
- What: Convert GPS (lat/lon) with a location encoder, then blend with image features using a learned zero-convolution layer.
- Why: Manually tuning fusion weights is hard; zero-conv lets the model learn how much location matters.
- Example: For Europe, Baroque likelihood rises; for Australia, certain modern periods may be more probable.
Class encoder (coarse styles):
- What: Encode seven broad styles/period prompts as text vectors (Roman…Contemporary).
- Why: Anchors the coarse stage in language space.
- Example: The image vector compares against “Gothic” vs. “Baroque” vectors to get similarity scores.
Reason encoder (explainable features):
- What: Encode a bank of short prompts, like Roof: dome/spire/mansard; Wall: stone/brick; Height: low/mid.
- Why: Help the regressor focus on visible cues and produce human-checkable reasons.
- Example: High similarity to “Roof: dome” and “Wall: stone” boosts certain periods.
Regressor (coarse-to-fine ordinal year):
- What: Combine style and reason similarities; compute probabilities over periods; output a year as a weighted mix adjusted by confidence.
- Why: Enforces time order and uses visual-language cues together.
- Example: If probabilities cluster around Baroque and early Neoclassical, the final year lands around late 1600s/early 1700s with a rationale.

Training recipe

What happens: Fine-tune CLIP features with an ordinal, ranking-style loss that penalizes out-of-order mistakes more strongly the further they are. Negative examples get higher weights when their years are far away.
Why it exists: Teaches the model that 1700 is closer to 1680 than 1400, guiding better gradients.
Example: If the true year is 1868, confusing it with 1880 is a lighter penalty than mixing it with 1500.

Explainability pipeline

What happens: Compute importance scores for reasoning prompts by mixing similarity and regressor attention. Keep top reasons and map them to a readable explanation.
Why it exists: Trust and debugging—humans can see why the model picked a year.
Example: “Predicted 1880 because: roof spire, ornate ornamentation, curtain windows, stone wall.”

What breaks without each step:

No ordinal regression: The model treats 2-year and 200-year mistakes similarly; learning becomes unstable.
No GPS fusion: Look-alike styles across continents cause more confusion.
No reasoning prompts: We lose transparency and some fine cues that help dating.
No popularity-aware metric: We might wrongly celebrate models that are just great at recalling landmarks.

Secret sauce:

Marrying CLIP’s strong vision-text space with ordinal training preserves the timeline.
Zero-conv fusion lets the model learn how much to trust location.
A bank of simple reasoning prompts guides attention to age-relevant details and yields checkable rationales.

04Experiments & Results

The test: Predict each building’s construction year on the fixed test split (11,087 images) and score using MAE (years off) and IA (within ±5, ±20, ±50, ±100 years). Also, split results by popularity (page views) to see if models do better on famous buildings.

The competition: Over 30 models, including CNNs (ResNet-50, ConvNeXt-B), Transformers (ViT-B/16, Swin-B), CLIP-based (CLIP zero-shot, GeoCLIP, NumCLIP, YearCLIP), closed-source VLMs (Gemini 1.5/2.0, Grok2, Claude), and open VLMs (Qwen, Gemma, InternVL, LLaVA, MiniCPM, Phi-4-MM). This broad field reveals trends across architectures.

Scoreboard with context:

YearCLIP leads open CLIP-based baselines: MAE ≈ 39.5 vs. ConvNeXt-B ≈ 44.4 and Swin-B ≈ 47.7. That’s like moving from a solid B to a higher B+ on a tough time-guessing quiz.
Zero-shot CLIP (no fine-tuning) has MAE ≈ 78.2—surprisingly better than some instruction-tuned open VLMs on this task, showing the value of CLIP’s pretraining.
Closed VLMs top the MAE charts: Gemini1.5-Pro ≈ 33.1, Gemini2.0-Flash ≈ 33.9, Grok2 ≈ 35.3, with the best open VLM (Gemma3-27B) ≈ 36.5. That’s like an A- to A range in average error compared to B-range for many vision-only models.

Popularity bias (the surprising headline):

Many standard vision models (CNNs/Transformers/CLIP-based) actually do worse on the most popular buildings—likely because those landmarks are renovated, mixed-style, or visually tricky. For example, a CLIP-based model’s IA±5 can drop by around 8 percentage points between less and more popular bins.
In contrast, big VLMs do much better on highly popular buildings: Gemini2.0-Flash’s IA±5 jumps from about 24% to 58% (+34%). That’s like getting extra credit on questions you’ve already seen. It strongly suggests memorization from web-scale pretraining.

Regions and time periods:

Geography: Errors are lowest in the Americas and Australia, highest in Africa and parts of Europe, with Asia in between. This mirrors dataset balance and pretraining bias. For instance, Gemini2.0-Flash can be ≈ 23.5 MAE in the Americas but ≈ 62.7 in Africa.
Timeline: Everyone struggles with very old buildings (1000–1600), where MAE can exceed 300 for some models—like trying to date a photo from a tiny, faded clue. Performance is best post-1900 where data is plentiful and styles are more documented.

Density and renovation:

Semi-urban areas often have the lowest MAE; rural and dense urban areas are harder, perhaps due to variety (rural) or visual clutter (urban).
Renovation or rebuilds raise error: visible style mismatches the original year, making the ground-truth label hard to infer from current appearance.

Explainability wins:

YearCLIP’s explanations highlight roof types, walls, and symmetry that influenced the year. These rationales matched human expectations in many cases and helped catch oddities (e.g., a modern roof on an older base).

Unexpected findings:

Some open VLMs trail behind simple CLIP fine-tuning, showing that instruction-following alone doesn’t guarantee better time-reasoning.
Standard vision models perform relatively worse on famous landmarks, the opposite of VLMs—likely because they don’t benefit from text-side memorization.

Bottom line: Closed VLMs get top accuracy but reveal strong popularity bias. YearCLIP narrows the error with explainability and ordinal structure, offering a transparent and reproducible baseline.

05Discussion & Limitations

Limitations:

Geographic and temporal skew: More images from the Americas and from modern centuries mean models learn those cases better. Underrepresented regions and ancient periods remain challenging.
Renovation label noise: Ground-truth uses original construction year even when the facade is heavily renovated or rebuilt, which can confuse visually grounded models.
Popularity proxy: Wikipedia page views are a practical proxy but not a perfect measure of fame in the real world.

Required resources:

YearGuessr is open and CC BY-SA 4.0; you need standard GPUs to fine-tune YearCLIP (the paper used an RTX 4090).
For closed VLM evaluations, you need API access.
Reasoning prompts can be generated once and reused; the training code is standard deep learning fare.

When NOT to use:

For high-stakes, per-building decisions in underrepresented regions or ancient eras without human review.
When facades are non-representative (heavy modern cladding on an old core) or photos are interiors/oblique and break the assumption of clear facade cues.
When you need certainty about post-renovation ‘effective year’ rather than original construction year.

Open questions:

How to better disentangle original vs. current appearance to handle renovations robustly?
Can active learning or targeted data collection reduce geographic and temporal bias?
How to reduce popularity bias in VLMs without harming overall accuracy—data curation, regularization, or debiasing losses?
Could synthetic data (e.g., diffusion-based augmentations) fill early-period gaps without drifting styles?
What new prompts or visual parsing (e.g., facade element detection) best strengthen reasoning without overfitting?

06Conclusion & Future Work

Three-sentence summary: This paper introduces YearGuessr, an open, global dataset of 55k building facades with continuous construction years, GPS, captions, and popularity to study how models predict age. It shows that many powerful VLMs are much better on famous landmarks than on ordinary buildings, revealing popularity-driven memorization rather than pure architectural reasoning. It proposes YearCLIP, a CLIP-based, ordinal-regression model with GPS fusion and simple reasoning prompts that improves accuracy and provides human-checkable explanations.

Main achievement: Establishing the first large, open, multimodal, ordinal benchmark for building age estimation and proving—quantitatively—that popularity bias inflates VLM performance on famous sites.

Future directions:

Expand low-resource regions and ancient periods; refine labels for renovations and rebuilds.
Explore debiasing strategies, active learning, and diffusion-based augmentation to balance coverage.
Improve explainability with richer architectural prompts and finer facade parsing.

Why remember this: It’s a reality check—top scores don’t always mean true understanding. By measuring time as an ordered value and shining light on popularity bias, this work nudges the field beyond memorization toward models that can date ordinary buildings fairly and transparently.

Practical Applications

•Heritage surveys: Quickly estimate ages for large sets of buildings to prioritize preservation.
•Retrofit planning: Identify older structures that may need energy upgrades or safety checks.
•Disaster response: After earthquakes or floods, estimate building ages to assess likely vulnerabilities.
•Urban studies: Map city growth by decades or centuries using consistent, explainable estimates.
•Tourism and education: Create timelines and guided tours that connect styles to time periods.
•Data cleaning: Flag possible Wikipedia or registry inconsistencies when appearance and claimed year disagree.
•Model auditing: Use popularity-aware metrics to detect and reduce memorization bias in VLMs.
•Policy fairness checks: Compare model performance across regions to guide equitable data collection.
•Museum and archive curation: Enrich photo collections with plausible construction-year metadata.
•Architecture teaching aids: Show students how features like roofs and walls relate to historical periods.

Version: 1