SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li; Wenbo Chen; Yimin Liu; Shenghan Zheng; Xiaokun Chen; Yifeng He; Yubo Li; Bingran You; Haotian Shen; Jiankai Sun; Shuyi Wang; Qunhong Zeng; Di Wang; Xuandong Zhao; Yuanli Wang; Roey Ben Chaim; Zonglin Di; Yipeng Gao; Junwei He; Yizhuo He; Liqiang Jing; Luyang Kong; Xin Lan; Jiachen Li; Songlin Li; Yijiang Li; Yueqian Lin; Xinyi Liu; Xuanqing Liu; Haoran Lyu; Ze Ma; Bowei Wang; Runhui Wang; Tianyu Wang; Wengao Ye; Yue Zhang; Hanwen Xing; Yiqi Xue; Steven Dillmann; Han-chung Lee

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Intermediate

Xiangyi Li, Wenbo Chen, Yimin Liu et al.2/13/2026

arXiv

Key Summary

•SkillsBench is a big test playground that measures whether giving AI agents step-by-step 'Skills' actually helps them finish real tasks.
•It includes 84 tough, real-world tasks in 11 areas like healthcare, energy, finance, and software, all with strict, automatic graders.
•Each task is run three ways: without Skills, with expert-picked (curated) Skills, and with Skills that the model tries to write by itself.
•Curated Skills boost success by an average of +16.2 percentage points across seven agent–model setups, but the size of the boost depends on the domain and the harness.
•Self-generated Skills don’t help on average (–1.3 points), showing models are not yet good at authoring the very procedures they benefit from following.
•Short, focused Skills (2–3 modules) beat long, encyclopedic docs; too many or too-long Skills cause confusion and lower scores.
•In some fields like healthcare and manufacturing, Skills make a huge difference (+52 and +42 points), while in math and software the gains are much smaller.
•Smaller models plus good Skills can match or beat bigger models without Skills, meaning Skills can partly replace pure model size.
•Everything is checked with deterministic, code-based verifiers (no 'LLM-as-judge'), so pass/fail is consistent and reproducible.
•SkillsBench gives teams a fair way to decide which Skills to write, which ones to keep short, and when Skills won’t help enough to be worth it.

Why This Research Matters

In real workplaces, many mistakes come from missing a small but important step—like forgetting a unit conversion or a final quality check. SkillsBench shows that giving AI agents short, clear 'how-to' Skills can reduce those mistakes and help them finish real tasks more often. It also proves you don’t always need the biggest, most expensive model if you have good Skills, which can save money. Teams now get data-backed rules of thumb: keep Skills focused, use 2–3 at a time, and avoid giant docs. Because everything is graded by code, results are trustworthy and repeatable. This makes AI assistants more dependable in areas that really matter, from healthcare data cleaning to financial analysis and manufacturing planning.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

Concepts first (explained with the Sandwich pattern)

🍞 Hook: You know how having a recipe helps you cook a new dish faster than guessing ingredients? 🥬 The Concept (Agent Skills): Agent Skills are recipe-like packets that tell an AI agent exactly how to do a kind of task (procedures, checklists, examples, and optional scripts). How it works: 1) Package step-by-step instructions and helpful resources, 2) Give them to the agent at run time, 3) The agent follows the steps to solve tasks more reliably. Why it matters: Without Skills, agents rely on fuzzy memory and may miss critical steps. 🍞 Anchor: A spreadsheet-cleaning Skill might say 'standardize column names, check date formats, then verify totals,' turning a messy sheet into a clean one.

🍞 Hook: Imagine a science fair where every experiment has the same rules and a clear scorecard. 🥬 The Concept (SKILLSBENCH): SKILLSBENCH is a standardized set of tasks and strict graders to test if Skills really help AI agents. How it works: 1) Put each task in a clean container, 2) Run it three ways (no Skills, curated Skills, self-made Skills), 3) Use a programmatic verifier to mark pass/fail, 4) Compare results. Why it matters: Without a fair scoreboard, we can’t tell if Skills are helpful or just extra words. 🍞 Anchor: It’s like timing runners on the same track with the same stopwatch.

🍞 Hook: Think of a librarian who hand-picks the best books for a kid learning science. 🥬 The Concept (Curated Skills): Curated Skills are expert-selected, well-written procedures for a whole class of problems. How it works: 1) Experts write clear steps and include examples or scripts, 2) They ensure the steps apply to many similar tasks, 3) Agents load and follow them during the task. Why it matters: Without curated quality, agents get lost in vague tips. 🍞 Anchor: A 'clinical unit conversion' Skill that explains how to convert mg/dL to mmol/L correctly across lab systems.

🍞 Hook: Imagine a student trying to write their own textbook right before the exam. 🥬 The Concept (Self-Generated Skills): Self-generated Skills are procedures the model writes for itself on the spot. How it works: 1) The agent drafts 1–5 mini-Skills from the task description, 2) Saves them, 3) Tries to use them. Why it matters: If models can’t write solid procedures, this won’t help much—and it might distract them. 🍞 Anchor: The model says “use pandas” but forgets the exact pivot-table recipe, so the solution fails the checker.

🍞 Hook: Report cards don’t just say 'good' or 'bad'—they give grades so you can compare. 🥬 The Concept (Performance Metrics): These are the ways we score agents, like pass rate (how many tasks they solve) and normalized gain (how much closer to perfect they got with Skills). How it works: 1) Run trials, 2) Compute pass/fail per task, 3) Average across tasks and setups. Why it matters: Without numbers, we can’t see progress or trade-offs. 🍞 Anchor: '48.7% pass rate with Skills' is like getting a strong B where others are at a C.

🍞 Hook: If two kids both improve, it helps to know who improved more compared to how far they had left to go. 🥬 The Concept (Normalized Gain): Normalized gain measures improvement toward 100% relative to your starting point. How it works: (with Skills − without Skills) ÷ (100 − without Skills). Why it matters: A 5-point jump from 90 to 95 is not the same as 5 points from 10 to 15. 🍞 Anchor: Going from 30% to 45% might be a big relative leap, even if the final number isn’t the highest.

🍞 Hook: Coaches replay a game to see what moves helped or hurt. 🥬 The Concept (Trajectory Analysis): This looks at the agent’s step-by-step path through the task—what it tried, what failed, and why. How it works: 1) Log each action, 2) Match failures to categories (timeout, wrong format, wrong math), 3) Spot patterns. Why it matters: Without this, you can’t fix repeated mistakes. 🍞 Anchor: Seeing that agents often build the right file but use the wrong formula tells you to add a formula checklist to the Skill.

The World Before: Large language models (LLMs) got great at writing and reasoning in general, but many real jobs depend on precise procedures—like how to structure a CSV, how to safely edit a config, or the exact steps to build a pivot table. Fine-tuning a model for every domain is expensive and narrows its general ability. So teams began handing LLM agents extra 'how-to' notes at run time—Agent Skills.

The Problem: Everyone started making Skills, but there was no fair, standard way to tell if Skills actually helped. Did they reduce mistakes? Did shorter Skills beat longer ones? Could models write their own Skills? Without a shared benchmark, teams guessed.

Failed Attempts: Prior agent benchmarks mostly tested a model’s raw ability alone (no Skills), or they used fuzzy, opinion-based graders. Others mixed factual retrieval (like looking up facts) with procedural help (how to do a thing), which blurred results. None treated Skills themselves as first-class things to measure.

The Gap: We needed a controlled, apples-to-apples test bed that: (1) runs the same tasks with and without Skills, (2) uses strict, code-based verifiers for pass/fail, (3) logs every step, and (4) compares curated Skills to self-generated ones.

Real Stakes: In everyday life, this decides whether your AI coding helper fixes the right bug, whether a spreadsheet Skill prevents a billing error, and whether a healthcare data Skill changes units safely. It also controls cost: maybe a smaller, cheaper model plus good Skills beats a giant one without them. That’s money saved and fewer mistakes at work.

02Core Idea

The 'Aha!' in one sentence: Treat Skills as first-class, testable 'recipes' and measure their real value by running the exact same tasks with and without them under strict, automatic graders.

Three analogies for the same idea:

Cooking: You test whether a recipe helps by timing cooks with and without that recipe making the same dish, then tasting with the same judging rules.
Sports: You measure the impact of a new playbook by having teams run the same drills with and without the playbook, scored by the same referee.
Lego: You see if building guides help by giving kids the same bricks and asking them to build the same model, then checking with a checklist.

Before vs After:

Before: Skills were added ad hoc; no one knew if longer or shorter Skills were better, or if models could write their own. Success often depended on which agent harness you used and how you injected context, but comparisons were muddy.
After: With SKILLSBENCH, you get matched comparisons (no Skills vs curated vs self-generated) across 84 realistic tasks, strict pass/fail verifiers, and full logs. Now you can say things like 'curated Skills add +16.2 points on average, especially in healthcare and manufacturing,' or '2–3 focused Skills work best,' or 'self-generated Skills don’t help on average.'

Why it works (intuition, not equations):

Paired testing removes confounders: Run the same task, in the same container, with the same timeouts, changing only the presence or type of Skills. That isolates the effect of procedural guidance.
Deterministic verifiers remove opinion: Programmatic tests (like unit tests) make pass/fail crystal clear. No 'LLM-as-judge' means less noise.
Stratifying across agents and domains reveals patterns: Some harnesses use Skills more reliably; some domains need procedures more. You spot when Skills are truly the missing piece.

Building blocks (the idea broken down):

Skills definition: Must be procedural (how-to), reusable for a class of tasks, structured (SKILL.md + optional scripts/examples), and portable across harnesses.
Task design: Realistic instructions, Dockerized environment, a reference solution (oracle), and deterministic verifier tests.
Three conditions: (1) No Skills, (2) Curated Skills, (3) Self-generated Skills created by the model before trying the task.
Metrics: Pass rate and normalized gain; both are reported because absolute jumps and proportional improvement tell different stories.
Analysis layers: Model–harness combos, domain-level differences (e.g., healthcare vs software), task-level winners/losers, and design factors (how many Skills, how long, how detailed).

More Sandwich concept mini-explanations introduced here:

🍞 Hook: When you install a game console, the operating system decides how games run and where saves go. 🥬 The Concept (Agent Harness): The agent harness is the 'operating system' that loads Skills, manages tools, and runs the model through the task. How it works: 1) Finds Skills in a known folder, 2) Feeds Skills and instructions to the model, 3) Handles tool calls and logs. Why it matters: If the harness doesn’t surface Skills well, good Skills may be ignored. 🍞 Anchor: Claude Code often loads and uses Skills more reliably than some other harnesses.

🍞 Hook: A teacher’s answer key makes grading fair every time. 🥬 The Concept (Deterministic Verifier): A deterministic verifier is a set of automatic, code-based checks that say pass or fail with no guessing. How it works: 1) Run tests, 2) Each test asserts facts (files exist, numbers match within tolerance), 3) All must pass. Why it matters: Without deterministic checks, scores vary with mood or wording. 🍞 Anchor: 'pytest' checks that your CSV has the right headers and sums.

🍞 Hook: Counting how many questions you got right on a quiz tells you how close you are to 100%. 🥬 The Concept (Pass Rate): Pass rate is the percentage of tasks that the agent fully solves under the rules. How it works: 1) Try each task several times, 2) Average pass/fail, 3) Average across tasks. Why it matters: It’s the clearest scoreboard for 'Did the agent actually do the job?' 🍞 Anchor: '48.7% with Skills' means almost half the tasks were solved end-to-end.

Putting it all together: The key insight is simple but powerful—Skills are only useful if they help agents pass strict tests on real tasks. By turning Skills into measurable, portable 'apps' you can combine with different 'operating systems' (harnesses) and 'CPUs' (models), you can finally see which Skills, how many, and how long they should be to make the most difference.

03Methodology

High-level recipe: Input → [Set up controlled task box] → [Run three conditions] → [Strict auto-grader] → Output scores and logs

Build fair, self-contained tasks

What happens: Each task is packaged in a Docker container with: human-written instructions, task data, optional Skills folder, a reference 'oracle' solution, and a deterministic pytest-based verifier. Everything runs inside this clean box.
Why this step exists: Keeps runs reproducible and prevents 'leaking' shortcuts. Without it, differences in machines or hidden files could change results.
Example: An 'Excel pivot analysis' task includes the data files, a clear instruction.md, an oracle script that really makes the pivot, and a test_outputs.py that confirms the pivot totals.

Define what counts as a Skill

What happens: Skills must be procedural (how-to), reusable for a class of tasks, structured (SKILL.md plus optional scripts/templates/examples), and portable (filesystem-based so any harness can load them). They are not just prompts, random examples, or tool manuals.
Why this step exists: Keeps 'Skills' from turning into answer keys or noisy context. Without a clear definition, comparisons break.
Example: A 'PDF-to-Excel diff' Skill gives steps: extract tables with tool X, align columns, compare hashes, and verify row counts; it may include a helper script.

Run three controlled conditions

What happens: For each task, the agent tries it three ways: • No Skills: Only see the instruction. • With Skills: Load the curated environment/skills directory. • Self-Generated Skills: The agent is told to first write 1–5 mini-Skills it thinks it needs, save them, then solve the task.
Why this step exists: This isolates the effect of Skills and tests whether models can author helpful procedures themselves.
Example: On 'sales-pivot-analysis,' the agent either wings it, uses curated pivot-table procedures, or writes its own how-to first.

Keep the grader strict and repeatable

What happens: The pytest verifier runs assertions with tolerances (for numbers) and marks pass/fail. No LLM judges. Each task is tried multiple times; results are averaged.
Why this step exists: Removes subjectivity and reduces noise. Without it, tiny wording changes could flip a score.
Example: The verifier checks that the output file exists, column names match, totals match within ±0.1%, and no extra columns were added.

Evaluate across multiple agent–model stacks

What happens: Seven configurations are tested (e.g., Claude Code + Opus 4.5, Gemini CLI + Gemini 3 Flash, Codex + GPT-5.2). All sampling at temperature 0 for determinism.
Why this step exists: Harnesses differ in how they surface and use Skills; models differ in reasoning. Without breadth, you can’t generalize.
Example: Gemini 3 Flash with Skills reaches the highest pass rate (48.7%); Claude Code with Opus 4.5 shows the biggest jump (+23.3 points).

Measure with two complementary metrics

What happens: Compute pass rate (absolute success) and normalized gain (improvement toward 100% relative to baseline). Report both.
Why this step exists: Absolute jumps can be small at the top due to ceiling effects; normalized gain reveals proportional lift.
Example: Moving from 30% to 45% is a big relative jump, while 90% to 95% is small relatively even if both are +5 points.

Analyze by domain, task, and Skill design

What happens: Break down results across 11 domains, identify tasks with big positive or negative deltas, and study how the number and length of Skills affect outcomes.
Why this step exists: Not all areas need the same help; sometimes too many instructions cause overload.
Example: 2–3 Skills are optimal; 4+ Skills drop improvements to +5.9 points. Detailed/compact Skills beat comprehensive ones.

Log and categorize failures (Trajectory Analysis)

What happens: Every attempt logs actions and is matched with verifier results to classify failures: timeouts, spec violations, domain knowledge gaps, incorrect implementations, quality-below-threshold, and more.
Why this step exists: To fix Skills, you must know how agents fail. Without a taxonomy, patterns stay hidden.
Example: The dominant failure is 'quality below threshold'—outputs exist but numbers don’t meet tolerances—so Skills should emphasize checklists and validations.

The secret sauce:

Paired, deterministic evaluation of Skills as first-class artifacts. By holding everything else constant and only flipping which Skills are present, SKILLSBENCH turns a fuzzy idea—'do Skills help?'—into clear, reproducible numbers.
Skill design experiments. Testing how many Skills to include and how detailed they should be reveals practical authoring rules (short and focused wins).
Cross-harness reality check. Because different harnesses load and apply Skills differently, the benchmark’s multi-harness design shows when great Skills still won’t land (and why).

04Experiments & Results

The test: What was measured and why?

Measure: End-to-end pass rate on 84 realistic tasks across 11 domains, scored by deterministic verifiers. Also compute normalized gain to capture proportional improvement.
Why: Pass rate answers the simple question 'Did the agent actually finish the job correctly?' Normalized gain helps compare improvements that start from different baselines.

The competition: Who/what was compared?

Three conditions per task: (1) No Skills, (2) Curated Skills, (3) Self-generated Skills (where supported).
Seven agent–model configurations: Claude Code (Opus 4.5/4.6, Sonnet 4.5, Haiku 4.5), Gemini CLI (Gemini 3 Pro, Gemini 3 Flash), Codex (GPT-5.2).
Total: 7,308 valid trajectories, multiple runs per task for stability.

The scoreboard (with context):

Curated Skills help a lot on average: +16.2 percentage points uplift across configurations. That’s like moving a class average from a C to a solid B.
Best overall with Skills: Gemini CLI + Gemini 3 Flash at 48.7% pass rate (top absolute score). Biggest jump: Claude Code + Opus 4.5 with +23.3 points (top improvement).
Self-generated Skills: –1.3 points on average (no real help). Only Opus 4.6 shows a small gain (+1.4); Codex + GPT-5.2 drops notably (–5.6). Like trying to write your own textbook during the test—it mostly backfires.

Domain-level results (why they vary):

Biggest winners: Healthcare (+51.9) and Manufacturing (+41.9). These domains need exact procedures (unit harmonization, job-shop scheduling) that Skills spell out.
Moderate lifts: Cybersecurity (+23.2), Natural Science (+21.9), Energy (+17.9), Office (+17.8), Finance (+15.1), Media (+13.9).
Small lifts: Robotics (+7.0), Mathematics (+6.0), Software Engineering (+4.5). These may already be covered well in pretraining or are less about brittle procedures.

Task-level surprises:

Huge beneficiaries: 'mario-coin-counting', 'sales-pivot-analysis', 'flood-risk-analysis', 'sec-financial-report' (some +70 to +86 points). Clear, specialized procedures unlock success.
Negative deltas (16 of 84 tasks): 'taxonomy-tree-merge' (–39.3), 'energy-ac-optimal-power-flow' (–14.3), 'trend-anomaly-causal-inference' (–12.9), 'exoplanet-detection-period' (–11.4). Likely due to conflicting or overly heavy Skills causing confusion or missteps.

Skill design factors:

Quantity: 2–3 Skills are best (+18.6 points). 1 Skill still helps (+17.8). 4+ Skills drop to +5.9—information overload.
Complexity: Detailed or compact Skills perform best (+18.8 and +17.1). 'Comprehensive' Skills actually hurt (–2.9)—too long, too much to sift.

Model scale trade-offs:

Smaller + Skills vs bigger without: Claude Haiku 4.5 with Skills (27.7%) beats Opus 4.5 without Skills (22.0%). Skills can partially substitute for size.

Cost–performance notes (from token/cost analysis where available):

Skills raise input tokens modestly (6–13%), but the pass-rate gains are much larger—a strong trade.
Gemini 3 Flash uses more tokens than Pro but is cheaper per token, so with Skills it achieves a better cost per solved task and sits on the improved Pareto frontier.

Unexpected findings:

Self-generated Skills weren’t a free win; they often distracted agents or stayed too vague ('use pandas') without concrete API steps.
Some harnesses admitted Skills existed but then ignored them, showing that integration quality matters as much as Skill quality.

Bottom line: Curated, focused Skills produce meaningful, repeatable gains—especially where procedures rule the day. But more and longer isn’t better, and the harness you use changes how much benefit you see.

05Discussion & Limitations

Limitations (honest take):

Scope: Tasks are terminal-based and containerized. Results may differ for GUI agents, multi-agent teamwork, or very long workflows.
Models/harnesses: Only certain commercial stacks were tested; updates or other systems could behave differently.
Context length: Adding Skills adds tokens. While self-generated controls suggest 'structure beats length,' stronger length-matched baselines (e.g., random text, plain docs) would sharpen the causal story.
Optimistic skill quality: Benchmark uses high-quality Skills; real-world Skills are often noisier. Gains in practice may be smaller without curation.

Required resources:

Dockerized environments, strict verifiers, and access to supported agent harnesses and models. Skills must be authored and placed correctly for the harness to load them.

When NOT to use Skills (or use sparingly):

Tasks already within the model’s comfort zone (e.g., simple math or common coding tasks) where Skills add overhead or conflict with strong priors.
Situations where too many Skills (4+) or sprawling 'comprehensive' docs risk cognitive overload or context spill, which can reduce pass rates.
Harnesses that don’t reliably surface or apply Skills; even great Skills make little difference if never used.

Open questions:

Automatic Skill synthesis: Can we learn concise, procedural Skills from demonstrations or codebases without hurting clarity?
Skill composition: How do multiple Skills interact—when do they stack and when do they clash? Can we predict composite effects from single-Skill results?
Cross-modality: How should Skills look for GUI/vision agents? What’s the right structure for click/drag sequences and visual templates?
Better controls: What exact parts of a Skill (steps, examples, code templates) drive most gains? Which minimal combination works best?

Takeaway: Skills close procedural gaps, but they must be short, sharp, and suited to the harness. The benchmark turns this from guesswork into data-backed practice.

06Conclusion & Future Work

Three-sentence summary:

SKILLSBENCH fairly measures whether 'how-to' Skill packages help AI agents by running the same realistic tasks with and without Skills under strict, automatic graders.
Curated, focused Skills boost performance by +16.2 points on average, especially in procedure-heavy domains, while self-generated Skills don’t help on average and can distract.
Less is more: 2–3 compact/detailed Skills beat long, comprehensive docs, and smaller models with good Skills can match or beat larger models without them.

Main achievement:

Establishing Skills as first-class, testable artifacts—and proving, with rigorous paired evaluation, which Skill designs work, where they help most, and when they hurt.

Future directions:

Learn Skills automatically from demos or code while keeping them concise; study multi-Skill composition and cross-modal Skills for GUI agents; add stronger length-matched controls to isolate structure vs length.

Why remember this:

It changes Skills from a 'nice add-on' into an evidence-based tool: you can now pick how many Skills to write, how long they should be, and which domains pay off most—saving money (smaller models), time (fewer retries), and errors (safer procedures) in real-world agent deployments.

Practical Applications

•Author short, stepwise Skills (2–3 modules) for your top workflows (e.g., data validation, report generation) and measure uplift.
•Trim 'comprehensive' Skills into concise, action-focused versions with one worked example and a final checklist.
•Run A/B tests (no Skills vs curated Skills) on your own tasks using deterministic checks to prove value before wide rollout.
•Pair smaller, cheaper models with strong Skills to hit target accuracy while reducing inference costs.
•Tune your agent harness to reliably discover and apply Skills (e.g., folder structure, activation prompts, relevance filters).
•Create domain-specific validation steps inside Skills (unit checks, schema enforcement, tolerance thresholds) to reduce 'almost correct' failures.
•Identify tasks with low baseline pass rates and strong procedural needs (e.g., file-format brittle tasks) as prime candidates for Skills.
•Build a Skill library with clear names, frontmatter summaries, and minimal code templates for easy reuse across teams.
•Periodically review trajectory logs to spot recurring failure modes and add targeted steps to the responsible Skill.
•Avoid overloading: cap active Skills per task and prefer 'detailed' or 'compact' styles over encyclopedic docs.

Version: 1