Discovering Hidden Gems in Model Repositories

Jonathan Kahana; Eliahu Horwitz; Yedid Hoshen

Discovering Hidden Gems in Model Repositories

Intermediate

Jonathan Kahana, Eliahu Horwitz, Yedid Hoshen1/29/2026

arXiv PDF

Key Summary

•Millions of public AI models exist, but downloads are concentrated on a tiny set of “official” checkpoints, which are not always the best performers.
•The authors evaluate over 2,000 fine-tuned models and find many “hidden gems” that beat popular models on math, coding, and general benchmarks without higher inference costs.
•In the Llama‑3.1‑8B family, an unpopular math fine-tune jumps accuracy from 83.2% to 96.0% on GSM8K—a huge gain at the same size.
•Exhaustively testing every model is infeasible, so the paper reframes model selection as a bandit-style search with a fixed query budget.
•They adapt Sequential Halving with two key upgrades: testing all surviving models on the same questions (correlated sampling) and aggressively eliminating weak models early.
•This method reliably finds top‑3 models with as few as 50 queries per candidate—over 50× faster than exhaustive evaluation.
•Across four major model trees (Qwen‑3B/7B, Mistral‑7B, Llama‑3.1‑8B), their search beats standard baselines and often beats the most popular base models.
•Over 90% of discovered gems had no useful documentation, explaining why popularity-based or text-search heuristics miss them.
•The approach is practical but still needs a small number of queries per model and must be rerun for new tasks.
•Future work could combine this with weight-space learning and smarter query selection to make discovery even faster.

Why This Research Matters

This work shows that better AI performance is already sitting in public repositories—you just need a smart way to find it. Instead of paying more for bigger models, teams can unlock big gains (like +12.8% on math) at the same size and cost. That helps schools, small startups, and on-device apps get stronger results without new hardware. It also reduces waste: fewer, smarter evaluations avoid burning compute on weak candidates. Finally, it encourages healthier ecosystems where quality—not just popularity—wins, making AI progress more open, fair, and efficient.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine a giant library with millions of books. Most people grab the few shiny bestsellers stacked at the entrance, even though there might be better stories hidden deep on the shelves.

🥬 Filling (The Actual Concept): What was the world like before? Public AI repositories like Hugging Face now host millions of models, including countless fine-tunes of popular “base” models. Because model cards are often incomplete or missing, and because testing everything is too expensive, most users just pick the popular, official checkpoints (like the base or instruct versions). This creates extreme popularity concentration: a microscopic slice of models gets nearly all downloads, while most models sit nearly unseen. How it worked before (step by step):

Users face many models but have little reliable info.
They default to the official, well-known checkpoints.
Leaderboards help only for models that are submitted to them and measured in comparable ways.
Exhaustive re-benchmarking across many tasks is too costly. Why it mattered: This habit may leave performance on the table—especially for specific tasks like math or coding—because some lesser-known fine-tunes might fit those tasks much better.

🍞 Bottom Bread (Anchor): Think of picking a soccer team from a huge school. If you only try out the kids who are already famous, you might miss a fast new student who never tried out before—but could be your best striker.

🍞 Top Bread (Hook): You know how a family tree shows parents, kids, and grandkids? Model “families” look like that too.

🥬 Filling (The Concept: Model Trees): A Model Tree is the set of models that descend from the same base model (the same “ancestor”), like Llama‑3.1‑8B or Qwen‑2.5‑7B. How it works:

Start with a base model (the root).
People fine-tune it for different purposes (branches and leaves).
Each fine-tuned version keeps the same base size and cost, but may gain different skills. Why it matters: Comparing models within the same tree keeps inference costs equal, so any performance difference comes from better fine-tuning, not bigger size.

🍞 Bottom Bread (Anchor): If all players on your team must be the same age and height, then the best player is the one with the best training—not the tallest one.

🍞 Top Bread (Hook): Picture a lot of treasure chests in a cave. Some are shiny and easy to spot, but the real gold might be in a dusty box in the corner.

🥬 Filling (The Concept: Hidden Gems): Hidden gems are unpopular fine-tuned models that actually outperform the popular choices on a task. How it works:

Define “popular” as the top 1% by downloads.
Define “elite” as the top 1% by measured performance on a task.
A hidden gem is a model that is not popular, is elite on the task, and beats the best popular model on that task. Why it matters: If hidden gems exist, popularity is not a reliable guide to the best model.

🍞 Bottom Bread (Anchor): In the Llama‑3.1‑8B family, an under-the-radar math model boosts accuracy from 83.2% to 96.0% on GSM8K—like finding a gold medalist who wasn’t even on the school’s radar.

🍞 Top Bread (Hook): When you race kids on a track, you want a fair race: everyone runs the same distance.

🥬 Filling (The Concept: RouterBench): RouterBench is a bundle of tasks (like GSM8K for math, MBPP for coding, MMLU, ARC‑Challenge, and Winogrande) used to test models fairly. How it works:

Pick a set of standard questions across different skills.
Ask each model the same questions.
Measure accuracy and compare. Why it matters: Without a common test, you can’t tell who’s truly faster or smarter.

🍞 Bottom Bread (Anchor): It’s like giving every runner the exact same stopwatch and the same distance, so you can trust the results.

The Problem and Why Attempts Failed:

Problem: Are the most popular models actually the best? If not, how can we efficiently find the better ones among thousands or millions?
Failed attempts: Relying on popularity or text descriptions misses gems because many model cards are incomplete or irrelevant. Leaderboards only help when models are submitted and tested comparably. Exhaustive testing is too expensive.
The gap: We need a search strategy that finds great models fast, without testing everything.

Real Stakes in Daily Life:

Students and teachers want better math or coding tutors without needing giant models.
Small startups need top performance within tight budgets.
On-device apps (phones, laptops) benefit from the best small models, not just the famous ones.
Public agencies and nonprofits can deploy stronger models responsibly without extra cost.

02Core Idea

🍞 Top Bread (Hook): Imagine you have 1,000 mystery snacks but only a tiny number of taste-tests. How do you quickly find the yummiest one without taking a bite of everything?

🥬 Filling (The Concept in One Sentence): Treat model selection like a smart game of elimination—use a bandit-style search to test all candidates a little, throw out obvious losers early, and focus your precious tests on the most promising ones.

How it works (big picture):

Give each model a small, fair trial.
Rank them by how well they did.
Eliminate the worst chunk early (aggressive pruning).
Re-test the survivors using the exact same questions (shared queries) to compare apples-to-apples.
Repeat until you’re left with a top contender. Why it matters: You find top models fast—over 50× fewer queries than testing everyone fully—so the “needle in a haystack” becomes findable.

🍞 Bottom Bread (Anchor): It’s like a spelling bee: everyone gets the same words each round, half get knocked out early, and the best spellers get more words in later rounds.

Multiple Analogies:

Bake-off analogy: Give all bakers the same ingredients and same recipe (shared queries), taste a small slice from each, quickly dismiss flat cakes, and spend more time judging the finalists.
Talent show analogy: Everyone performs the same piece (fairness), early auditions remove weak acts (aggressive elimination), and judges give more time to the top acts (focused budget).
Treasure hunt analogy: Use a metal detector (the algorithm) to scan quickly, ignore piles of junk metal, and spend time digging only where the beeps are strongest.

Before vs After:

Before: People mostly chose official checkpoints or read uneven documentation; discovering gems required expensive, exhaustive evaluation.
After: A principled, fast search reliably finds top‑3 models with tiny budgets (as low as 50 queries per model), surfacing gems that beat the popular picks.

Why It Works (intuition without equations):

Fair comparisons reduce noise: Testing survivors on the same exact questions makes differences real and trustworthy.
Spend where it matters: Most uploads are weak; removing them early saves queries for close calls among top models.
Repetition sharpens confidence: Each round adds signal where it counts—on the finalists—so rankings stabilize quickly.

Building Blocks (explained simply):

🍞 Top Bread (Hook): You know how when shopping with only a few dollars, you try samples to pick the best deal?

🥬 Filling (The Concept: Budget-Constrained Model Discovery): This means hunting for the best model under a fixed limit of total questions you can ask (your “query budget”). How it works:

Set a total number of queries you can afford.
Decide how many to spend in each round.
Use them wisely to compare many models fast. Why it matters: Budgets are real—time and money aren’t infinite.

🍞 Bottom Bread (Anchor): Like choosing the best pizza using just 10 free taste-coupons—spend them cleverly, not all on the first slice you see.

🍞 Top Bread (Hook): Imagine slot machines in a row, each with unknown payout. Which one should you play with limited coins?

🥬 Filling (The Concept: Multi-Armed Bandit Problem): It’s a framework for learning which option is best when you can test each one only a little. How it works:

Try each option briefly.
Use results to guide what to try again.
Aim to identify the best option (best “arm”) under a fixed budget. Why it matters: It formalizes “smart trying” instead of random guessing.

🍞 Bottom Bread (Anchor): Like tasting a few bites from different food trucks to find your favorite before your allowance runs out.

🍞 Top Bread (Hook): Think of a tournament where each round cuts the player pool in half.

🥬 Filling (The Concept: Sequential Halving): A method that tests everyone a little, ranks them, removes the worst half, and repeats with more tests on the survivors. How it works:

Round 1: tiny test for all.
Eliminate bottom half.
Round 2: bigger test for those left.
Repeat until a winner remains. Why it matters: It zooms in on the best options quickly.

🍞 Bottom Bread (Anchor): Like a video game bracket—early levels are quick, the final boss gets your full attention.

03Methodology

At a high level: Input (a model tree and a fixed number of queries) → Round 1: quick tests for all → Rank and aggressively eliminate many → Round 2+: re-test survivors on the exact same questions (shared queries) with more budget → Final: pick the top performer.

Step-by-step recipe with purpose and examples:

Define the playground and the rules

What happens: Choose one model tree (e.g., Llama‑3.1‑8B) so all candidates have the same size and inference cost. Set a total query budget B you can afford.
Why this step exists: Keeping models similar in size makes performance differences meaningful (skill, not hardware). A fixed budget prevents runaway costs.
Example: Suppose there are 400 candidates and you can afford about N=50 queries per model on average; your total budget is B ≈ 20,000 queries.

Choose fair questions (shared query sets)

What happens: Build a query set drawn from tasks in RouterBench (e.g., GSM8K for math, MBPP for coding, MMLU, ARC‑Challenge, Winogrande). Subsample to a manageable size and hold that set fixed for all survivors in a round.
Why this step exists: If Model A gets easy questions and Model B gets hard ones, comparisons become noisy. Shared queries turn comparisons into apples-to-apples.
Example: Pick 2,500 total questions spread across tasks. In Round 1, all models see the same small slice; in Round 2, survivors see a larger, but still identical, slice.

🍞 Top Bread (Hook): You know how in science class, you test two plants by giving them the same sunlight and water so your results are fair?

🥬 Filling (The Concept: Correlated Sampling): Make models answer the exact same questions within a round so differences in accuracy truly come from the models, not luck of the draw. How it works:

In Round s, sample a batch of questions.
Give that same batch to every surviving model.
Compare results directly. Why it matters: It reduces variance (random wobble) in comparisons, so you don’t eliminate a good model by accident.

🍞 Bottom Bread (Anchor): It’s like timing all runners on the same track in the same weather—now you trust the winner is really faster.

Normalize evaluation choices per task

What happens: Some models have a recommended system prompt. Evaluate each model with and without it per task and keep whichever scored better for that task.
Why this step exists: Small formatting or prompting differences can unfairly help or hurt; this normalizes such effects.
Example: For coding (MBPP), Model X might do better with its built-in system prompt, while for math (GSM8K) it does better without—keep the better result per task.

Round 1: Spend early to avoid mistakes, then prune hard

What happens: Allocate a substantial chunk of the per-model budget to Round 1 to avoid prematurely dropping strong models due to noise. Then eliminate a large fraction at once to save budget (aggressive elimination to, say, the top ~100 models).
Why this step exists: Repositories are skewed—most uploads are weak or broken. A stronger first look plus big early cuts preserves contenders while freeing budget.
Example: With N=10 queries per model, spend ~6 in Round 1 to get a decent read, then keep only ~100 of 400 models for Round 2.

🍞 Top Bread (Hook): Cleaning your room? First, toss the obvious trash so you have space to organize the good stuff.

🥬 Filling (The Concept: Aggressive Elimination Schedule): After an informative first round, quickly remove a large group of low performers and focus resources on the promising few. How it works:

Use a bigger initial test to reduce early mistakes.
Cut down to a fixed small pool (e.g., 100 models) fast.
Reallocate the saved budget to later rounds for finer comparisons. Why it matters: It prevents wasting queries on obviously weak models while ensuring strong ones survive.

🍞 Bottom Bread (Anchor): Like tryouts where you spend a bit more time on everyone on day one, then invite only the top 100 back for day two.

Later rounds: Increase depth, keep comparisons fair

What happens: With fewer survivors, give more queries per model and keep using shared query sets per round. Rank again, eliminate the lower half or use your schedule’s cuts, and repeat until finalists remain.
Why this step exists: More queries per promising model sharpen your confidence in small performance gaps among top contenders.
Example: By Round 3, each remaining model may have answered several times more questions than in Round 1, clarifying close races (e.g., 0.729 vs 0.720 accuracy).

Scoring and selection

What happens: Within each task, compute accuracy (correct = 1, else 0). Aggregate results (e.g., RouterBench score). Pick the highest-scoring model.
Why this step exists: You want one clear winner by the end of your budget.
Example: In the Llama‑3.1‑8B tree with N=50, the method often returns a top‑3 overall model with accuracy ≈ 0.736, beating both the base (≈ 0.713) and standard baselines (~0.720).

Secret sauce: Two design choices

Correlated sampling (shared queries) slashes comparison noise and avoids unfair eliminations.
Aggressive early elimination respects the skewed quality distribution and saves budget for the real contenders.

Putting it together on real data:

Inputs: Four trees—Qwen‑3B, Qwen‑7B, Mistral‑7B, Llama‑3.1‑8B; subsampled RouterBench; budgets like N=10 and N=50.
Process: Round 1 uses a meaningful chunk of queries per model, then prunes to ~100. Later rounds deepen tests among survivors with shared queries. Per task, keep the better of with/without system prompt.
Output: A top model (often top‑3 in the entire tree) found with as few as 50 queries per candidate.

04Experiments & Results

The Test: What was measured and why

Goal: Given a fixed budget, retrieve the top-performing model in a model tree.
Datasets: Subsampled RouterBench (mix of ARC‑C, Winogrande, MMLU, MBPP, GSM8K) to 2,500 total queries for practicality.
Metrics: Mean rank of the retrieved model (lower is better) and top‑1 accuracy (higher is better) across 100 trials.
Budgets: Very low (N=10) and mid (N=50) queries per model, plus extended tests at N=25/100/200.

The Competition: Baselines compared

Random Selection; Best Base (the popular checkpoint); Uniform; UCB variants; TTTS; Successive Rejects; Bayesian Elimination; and standard Sequential Halving.

Scoreboard with context

Finding gems at low budgets is hard: With N=10, many baselines often fail to beat the popular base models in Qwen and Llama trees.
The proposed method shines: Even at N=10, it retrieves substantially better models than Best Base across trees (e.g., in Llama‑8B, ≈0.725 vs base ≈0.713).
At N=50, the method typically returns a top‑3 model: For Llama‑8B, ≈0.736 accuracy and rank ≈3.0, beating standard Sequential Halving (~0.720, rank ~29.9) and Best Base (~0.713). That’s like jumping from a solid B to an A‑.
Hidden gems are real and big: In Llama‑3.1‑8B, a math fine‑tune reaches 96.0% on GSM8K versus 83.2% for the popular baseline (+12.8%). Qwen‑3B math climbs from ~83.5% to 89.0%. Mistral‑7B shows dramatic gains across math, coding, and overall (e.g., +14 percentage points on RouterBench compared to the popular base).
Average uplift: Across tasks and trees, the method improves average performance by over 4.5% while using >50× fewer queries than exhaustive evaluation, which is the difference between “good enough” and “leaderboard-level” in many practical settings.

Surprising findings

Documentation gaps: Over 90% of gems lacked relevant performance documentation; some listed unrelated metrics (e.g., multilingual scores for a math gem). This explains why popularity and text search miss them.
No easy shortcuts: High performers weren’t neatly clustered near the tree root or along obvious branches. Simple heuristics based on downloads or graph centrality fail.
Budget matters, but smart spending matters more: Even when baselines improved with more queries (e.g., N=100/200), the proposed approach often matched or beat them using half the budget—showing the value of early aggressive pruning plus shared queries.

Takeaway: With shared queries (correlated sampling) and an aggressive elimination schedule built on Sequential Halving, you can reliably surface top‑3 models in large trees under tiny budgets, revealing gems popularity metrics overlook.

05Discussion & Limitations

Limitations (honest and specific)

Still needs some queries per model: Although >50× cheaper than exhaustive evaluation, the approach must test every candidate at least a little. If you truly cannot run any queries, you’d need weight-space learning or metadata—but those are not yet reliable at large LLM scale.
Task coverage: Results show gems in math, coding, reasoning, and general tasks, but new tasks (e.g., tool use, safety, multilingual dialog) require re-running the search with appropriate queries.
Data selection sensitivity: Using a small query subset risks overfitting to those samples. Shared queries reduce noise, but careful curation still matters.
Operational hiccups: Some repository models fail to load or run (version mismatches, missing tensors). Practical pipelines need robust fallbacks and logging.
Within-tree assumption: Comparisons are most fair within a model tree (same size/cost). Cross-tree searches may conflate performance with model scale or architecture.

Required resources

Modest compute for inference-only evaluation at small budgets; an orchestration script to load models, ask questions, and score answers; and storage for caching outputs.
A standardized harness (greedy decoding, consistent max lengths, per-task prompt handling) to ensure fair, repeatable comparisons.

When NOT to use

If you already have a large, high-quality, task-specific evaluation set and can afford to test a handful of carefully chosen candidates fully, simple exhaustive testing on that shortlist may suffice.
If your constraints are multi-objective (strict latency, memory, safety filters, cost), you may need a more complex search (multi-objective bandits) rather than pure accuracy.
If models vary widely in size and you cannot normalize costs, results may be skewed by capacity rather than fine-tuning quality.

Open questions

Smarter query selection: How to pick the smallest, most revealing question sets per task without bias?
Weight-space learning: Can we pre-rank candidates from their weights, then use tiny query budgets to confirm?
Cold-start and streaming: How to continuously incorporate newly uploaded models without restarting the whole search?
Multi-objective discovery: Jointly optimize accuracy, latency, cost, and safety.
Robustness and generalization: Ensure found gems stay strong on fresh, out-of-sample data and real users.

06Conclusion & Future Work

Three-sentence summary: Public model hubs hide many high-performing fine-tunes that beat the popular base models on key tasks. This paper reframes model selection as a budgeted bandit search and upgrades Sequential Halving with shared queries and aggressive early elimination. The result is a practical algorithm that reliably finds top‑3 models with as few as 50 queries per candidate, delivering >50× speedups over exhaustive evaluation.

Main achievement: Proving that hidden gems are widespread—and providing a fast, principled, and reproducible method to surface them under tight budgets.

Future directions: Combine this search with weight-space pre-ranking to shrink budgets even further; design tiny-yet-informative query sets; extend to multi-objective constraints (accuracy + latency + cost + safety); and support continuous, streaming discovery as repositories evolve.

Why remember this: Popularity isn’t performance. With a small, smart search, you can unlock big accuracy gains—like 96% math accuracy at the same 8B size—making better AI accessible to everyone without paying more per inference.

Practical Applications

•Auto-select the best small model for a task (e.g., math tutor, code helper) without changing hardware costs.
•Continuously scan new uploads in a model family and alert when a gem appears for your benchmarks.
•Build a CI pipeline that, on each release, runs budgeted discovery to pick the deployment model.
•For edge/phone apps, use discovery to choose the strongest 7B/8B model that fits on-device constraints.
•Cloud routing: periodically re-rank experts in a multi-model router using budgeted evaluation to improve end-to-end quality.
•Enterprise MLOps: maintain a shortlist of top models per task that is auto-refreshed weekly with small query budgets.
•Hackathon/education: let students run a tiny-budget search to find the best free model for their project in an hour.
•Safety/QA teams: include safety prompts in the shared query set to find models that balance accuracy with policy adherence.
•Research triage: pre-filter thousands of candidates to a top‑20 shortlist for deeper, expensive evaluations.
•Localization: discover the best same-size models for a specific language or domain (e.g., legal, medical) with minimal queries.

Version: 1