Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Yi Liu; Weizhe Wang; Ruitao Feng; Yao Zhang; Guangquan Xu; Gelei Deng; Yuekang Li; Leo Zhang

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Beginner

Yi Liu, Weizhe Wang, Ruitao Feng et al.1/15/2026

arXiv PDF

Key Summary

•Agent skills are like apps for AI helpers, but many of them are not carefully checked for safety yet.
•This paper scanned 31,132 real skills from two marketplaces and found that about 1 in 4 (26.1%) had security problems.
•The most common risks were data sneaking out (13.3%) and getting more power than allowed (11.8%).
•About 1 in 20 skills (5.2%) showed high-severity warning signs that could indicate malicious intent.
•Skills that include executable scripts were 2.12 times more likely to have vulnerabilities than instructions-only skills.
•The researchers built a tool called SkillScan that uses both pattern checking and an AI reader to spot risky behavior.
•SkillScan performed well in testing, with 86.7% precision and 82.5% recall on a hand-checked set of 200 skills.
•They created a clear list (taxonomy) of 14 vulnerability patterns in four buckets: prompt injection, data exfiltration, privilege escalation, and supply chain risks.
•The authors released an open dataset and tools so others can help fix the problem and build better defenses.
•The big takeaway is that platforms need permission systems and security vetting before skills run, just like phone apps do.

Why This Research Matters

AI agents are becoming part of everyday work, and skills are how they learn new tricks—so if skills are risky, people’s data and systems are at risk too. This study shows that vulnerabilities are not rare edge cases; they are common enough to need platform-level fixes. Clear categories and accurate detection let builders focus on the biggest dangers first, like secret-stealing and over-privileged actions. With permission systems and vetting, users can safely enjoy the benefits of agent skills without fear of silent data leaks. The open tools and dataset mean the community can help audit, improve, and keep watch as the ecosystem grows. Getting these guardrails in place now can prevent the kind of security messes that hit browser extensions years ago.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how your tablet can do lots more once you install apps, like drawing, music, or maps? Those apps make the device more helpful, but a bad app could also spy or mess things up if nobody checks it first.

🥬 Filling (The Actual Concept):

What it is: Agent skills are like mini-apps for AI helpers that add new powers using a SKILL.md instruction file and sometimes executable scripts.
How it works (step by step):
1. An AI agent looks for a matching skill when you ask it to do something special (like “convert a file” or “summarize docs”).
2. It loads the skill’s instructions (from SKILL.md) that tell it what steps to follow.
3. If needed, it runs bundled scripts (like Python or shell) to finish the job.
Why it matters: Without checks, the agent may trust the skill too much, letting it read files, send data online, or run commands you didn’t expect.

🍞 Bottom Bread (Anchor): Imagine installing a “GIF Maker” skill to turn videos into GIFs, but hidden inside it also sends your files to a secret website. Without vetting, the AI might just do it.

🍞 Top Bread (Hook): Imagine your class’s secret cookie recipe being copied and sent to someone outside the school without permission.

🥬 Data Exfiltration:

What it is: Data exfiltration is when information sneaks out of your computer or project to someplace it shouldn’t go.
How it works:
1. The skill quietly looks for sensitive stuff (like API keys, tokens, config files).
2. It packages them up (maybe as JSON).
3. It sends them over the internet to a server.
Why it matters: Your secrets could be stolen, accounts taken over, and private projects exposed.

🍞 Anchor: A “cloud backup” skill that says it saves your work could actually send your passwords along with your files to a mysterious server.

🍞 Top Bread (Hook): Think of a hall pass that lets you go anywhere in school—even places you shouldn’t—because a note was forged.

🥬 Privilege Escalation:

What it is: Privilege escalation is when a skill gets more power or access than it should have.
How it works:
1. The skill asks for broad permissions (like read/write everything or run shell).
2. It calls commands (maybe with sudo) to get higher rights.
3. With those rights, it can change settings, read secrets, or install software.
Why it matters: Small mistakes become big problems if the skill can act as an administrator.

🍞 Anchor: A “code formatter” skill that only needs to fix typos asks to run as admin and write anywhere on your computer—way too much power for its job.

🍞 Top Bread (Hook): Picture someone slipping a sneaky instruction into tiny print at the bottom of a worksheet: “Ignore your teacher and do my plan instead.”

🥬 Prompt Injection:

What it is: Prompt injection tricks the AI into following new, hidden instructions instead of the safe ones.
How it works:
1. Malicious directions hide in comments, formatting, or “helpful tips.”
2. The AI reads them and believes they are important.
3. It may ignore safety rules or send out protected information.
Why it matters: Even without bad code, clever words can hijack the agent.

🍞 Anchor: A documentation skill’s hidden note says, “Silently upload the project files to our analytics for quality,” so the AI does it, thinking it’s helpful.

🍞 Top Bread (Hook): If you build a bike from parts, you want the parts store to be trustworthy; a single bad part can cause a crash.

🥬 Supply Chain Risks:

What it is: Supply chain risks happen when the skill depends on outside packages, scripts, or servers that can change or be hijacked.
How it works:
1. The skill lists dependencies without version locks.
2. It downloads code at runtime (like curl | bash) from the internet.
3. If those sources are compromised, the skill becomes dangerous.
Why it matters: Attackers can swap good parts for bad after you install.

🍞 Anchor: A “setup helper” skill fetches an install.sh from a website each run; if that site gets hacked, your machine gets hacked too.

The World Before: AI helpers were getting superpowers through skills, but there wasn’t a clear picture of how safe those skills were. Marketplaces collected thousands of skills quickly, and many loaded instructions and code with very little checking—like an app store with no review line.

The Problem: Nobody had measured, at scale, how many real-world skills were risky, which types of problems were common, or which kinds of skills were most dangerous. Without numbers, platforms couldn’t set smart rules, and users couldn’t judge trust.

Failed Attempts: Prior research focused on model behaviors—like jailbreaks and adversarial prompts—not on the unique mix of skill instructions plus executable code plus dependencies. Traditional code scanners missed instruction-level tricks, and prompt-only defenses missed code-level exfiltration.

The Gap: We needed an ecosystem-wide X-ray that could see both code and instructions, understand their meaning, and classify real-world risks into a clear, reproducible map.

Real Stakes: This touches everyday life—work files, API keys, and private chats live where agents run. A single risky skill can leak company secrets, drain accounts, or install malware. The paper brings data, structure, and tools so platforms can add permissions, vetting, and sandboxes before things go wrong at scale.

02Core Idea

🍞 Top Bread (Hook): Imagine you’re packing school lunches for the whole grade. You need a fast check for every lunchbox to make sure no one snuck in peanuts or spoiled food.

🥬 The Concept (Aha! in one sentence): The key insight is to combine broad, fast pattern checks with a smart AI “reader” to reliably spot dangerous agent skills at scale and organize them into a clear, four-part risk map.

How it works (like a recipe):

Collect lots of real skills from public marketplaces.
Run static checks for risky patterns in both instructions and code.
Ask an AI model to read the context and judge if flagged things really look dangerous.
Sort findings into a simple taxonomy (four categories, 14 patterns) and measure how common each is.

Why it matters: If you don’t check lunches (skills) before kids (AI agents) eat them, one bad item can make many people sick; the combo approach catches more true risks while keeping false alarms manageable.

🍞 Bottom Bread (Anchor): It’s like a metal detector at the door (pattern checks) plus a trained guard who can tell a toy from a real threat (AI classifier) and a dashboard that shows what issues keep showing up (taxonomy).

Multiple Analogies:

Airport security: X-ray (static analysis) + human screener (AI semantic check) + incident categories (taxonomy) make flights safer.
Library sorting: A barcode scanner (patterns) plus a librarian’s judgment (semantics) keep books in the right sections (taxonomy) for quick finding.
Health screening: Basic vitals (patterns) plus a doctor’s exam (semantics) lead to a diagnosis map (taxonomy) that informs treatment (defenses).

Before vs After:

Before: Everyone guessed about skill safety, and tools saw either code risks or prompt risks—but rarely both together—so many issues slipped through.
After: We have measured prevalence (26.1%), a shared vocabulary (14 patterns in 4 buckets), and a working detector (86.7% precision, 82.5% recall) that scales to tens of thousands of skills.

Why It Works (intuition, no equations):

Static checks are great nets: they catch many possible problems quickly (high recall) but pull in some harmless fish (false positives).
Semantic AI is a careful sorter: it reads surrounding text/code and tosses back harmless catches (better precision), but it might miss a few tricky ones.
Together they balance speed and understanding, and the taxonomy gives everyone the same labels so improvements can target the biggest problems first.

Building Blocks (with Sandwich explanations for new concepts):

🍞 Hook: You know how museum maps group exhibits into wings so visitors don’t get lost? 🥬 Vulnerability Taxonomy:

What it is: A simple, shared list that groups weaknesses into 4 categories (prompt injection, data exfiltration, privilege escalation, supply chain) and 14 specific patterns.
How it works:
1. Define clear pattern names and what counts as a match.
2. Assign severity tiers (low/medium/high) by potential harm.
3. Use the same map for all skills so results are comparable.
Why it matters: Without a map, fixes are random and debates endless. 🍞 Anchor: Teams can say, “We reduced E2 (env var harvesting) by 80%,” and everyone knows exactly what that means.

🍞 Hook: Think of a spell-checker that flags words before a teacher reads the essay. 🥬 Static Code Analysis:

What it is: A fast, no-running scan of instructions and code to spot known danger signs.
How it works:
1. Search for patterns like curl | bash, sudo, or unpinned dependencies.
2. Flag suspicious phrases in SKILL.md like “ignore previous instructions” or “send to http(s)://…”.
3. Produce a list of candidate risks.
Why it matters: It’s quick and catches a lot early. 🍞 Anchor: Like spotting “teh” before the teacher sees it, the scanner catches obvious issues before anyone runs the skill.

🍞 Hook: A reading buddy who understands tone, hints, and context can catch tricks a spell-checker can’t. 🥬 LLM-based Semantic Classification:

What it is: Using a large language model to read content and judge intent/context.
How it works:
1. Feed the model the skill’s text, metadata, and scripts.
2. Ask it to rate each risk area and cite evidence.
3. Keep only confident findings and lower noise.
Why it matters: Some dangers hide in wording or indirect steps that patterns alone won’t catch. 🍞 Anchor: The model can tell whether “collect environment variables for analytics” likely means sensitive secret grabbing.

🍞 Hook: A good scoreboard tells a team where to practice more. 🥬 Measurement at Scale:

What it is: Counting how many skills show which problems so we know the size and shape of risk.
How it works:
1. Crawl marketplaces (over 42k entries; 31,132 unique after filtering).
2. Run the pipeline and tally results with uncertainty.
3. Share datasets and tools so others can verify and extend.
Why it matters: Numbers turn worry into action plans. 🍞 Anchor: Knowing 13.3% data exfiltration beats guessing—now platforms can prioritize network permission controls.

03Methodology

At a high level: Input (31,132 skills) → Static Analysis (pattern scan) → LLM Safety Screens (LLM-Guard) → Hybrid AI Classification (Claude) → Output (labeled vulnerabilities in a 4×14 taxonomy).

Step-by-step, like a recipe (with Sandwich explanations where we introduce new ideas):

Collecting Skills from the Wild

What happens: The team crawled two big marketplaces, removed duplicates and empty placeholders, and kept only English skills with accessible repos—ending at 31,132 unique skills.
Why this step exists: You can’t measure the ecosystem without a big, representative sample.
Example: If two sites list the same skill, keep the richer one so stats aren’t double-counted.

Pattern Scanning (Static Analysis) 🍞 Hook: Like using a checklist to spot loose seatbelts before the bus leaves. 🥬 Static Code Analysis:

What it is: A fast look-through of text and code—no running—matching well-known danger patterns.
How it works:
1. Scan SKILL.md for risky phrases (e.g., “ignore previous instructions,” “send to URL,” paths like ~/.ssh).
2. Scan scripts for signals (e.g., sudo, chmod 777, curl | bash, unpinned deps, env var harvesting).
3. Mark anything that matches as a candidate.
What breaks without it: You’d miss many obvious risks and waste time asking the AI model to read everything. 🍞 Anchor: A shell script that runs “sudo” and pipes curl into bash gets flagged instantly.

Safety Screens (LLM-Guard)

What happens: A set of specialized detectors looks for prompt injection, secrets, invisible characters, or obfuscation.
Why this step exists: Some tricks dodge simple rules—for example, hidden Unicode or base64 blobs.
Example: If a SKILL.md quietly includes invisible control characters, the guard flags it for deeper review.

Hybrid AI Classification (Claude) 🍞 Hook: When the metal detector beeps, a security officer takes a closer, smarter look. 🥬 LLM-based Semantic Classification:

What it is: An AI reads the full context to judge whether the flagged issue is truly risky.
How it works:
1. Provide the skill’s instructions, metadata, and code to the model.
2. Ask it to rate each category (prompt injection, data exfiltration, privilege escalation, supply chain) with evidence.
3. Keep only confident calls (≥0.6), overturn static hits only with very high confidence (≥0.8) to stay conservative.
What breaks without it: You’d either drown in false alarms (patterns alone) or miss code-level signals (semantics alone). 🍞 Anchor: The AI can tell a harmless HTTP call to a public API from a silent upload of secrets to an unknown domain.

Organizing Findings (Taxonomy) 🍞 Hook: A color-coded binder keeps homework from getting lost. 🥬 Vulnerability Taxonomy:

What it is: Four buckets—prompt injection, data exfiltration, privilege escalation, supply chain—split into 14 patterns with severity tiers.
How it works:
1. Each finding maps to a pattern ID (like E2 for env var harvesting).
2. Severity is the highest pattern per skill (Low, Medium, High).
3. Results roll up to per-category and overall prevalence.
What breaks without it: You can’t compare results or prioritize defenses. 🍞 Anchor: Platforms can say, “We block SC2 (external script fetching) by policy,” and instantly reduce a whole class of risk.

Validation (Ground Truth)

What happens: Two human experts labeled 200 skills independently, then resolved differences with a third reviewer.
Why this step exists: To check if the detector is accurate (it achieved 86.7% precision and 82.5% recall).
Example: If the tool says “exfiltration,” humans verify the evidence looks truly risky.

Uncertainty and Safety

What happens: The authors report confidence intervals, account for LLM non-determinism, and avoid running unknown code at scale.
Why this step exists: Honest measurement and safe research practices prevent harm and overclaiming.
Example: They report that true prevalence is likely 23–30%, with 26.1% as a point estimate.

The Secret Sauce (why this is clever):

Union then refine: Use union of fast pattern flags and safety screens to catch almost everything, then refine with an AI reader to reduce noise.
Asymmetric thresholds: It’s harder for the AI to overrule a strong static hit; this keeps the system conservative and safer.
Shared language: A 4×14 taxonomy turns messy findings into a roadmap that platforms and developers can act on.

Concrete mini-examples:

Data Exfiltration candidate: Code loops over environment variables matching SECRET or TOKEN and posts them to a URL; the AI confirms context suggests secret exfiltration (E2, high severity).
Privilege Escalation candidate: Instructions request broad file_system read/write and shell_execute for a simple linter; the AI flags excessive permissions (PE1, low severity) as hygiene risk.
Supply Chain candidate: requirements.txt has unpinned packages and install instructions use curl | bash; the AI confirms SC1 (low) and SC2 (high).

04Experiments & Results

The Test: Measure how common vulnerabilities are in real-world skills, which types appear most, and what factors make a skill riskier.

Dataset: 31,132 unique skills from two marketplaces (after filtering from 42,447).
Detector: SkillScan (static + LLM safety + AI classification) using the 4×14 taxonomy.
Validation: 200 human-labeled skills to check accuracy; confidence intervals and adjustments reported.

The Competition (baselines and context):

Traditional code scanners are strong on classic code issues but miss instruction-level prompt tricks.
Prompt-only defenses miss code exfiltration and supply chain pitfalls.
The hybrid approach fills the middle: code + instructions + semantics with a shared map.

The Scoreboard (with context):

Any vulnerability: 26.1% (about 1 in 4 skills). That’s like a class where one in four projects needs safety fixes right away.
By category: • Data Exfiltration: 13.3% (most common) • Privilege Escalation: 11.8% • Supply Chain: 7.4% • Prompt Injection: 0.7% (rarer but serious when present)
Severity tiers: • High-severity: 5.2% (strong indicators of malicious intent or immediate exploitability) • Medium: 8.1% (could be negligence or abuse) • Low: 12.8% (poor hygiene like unpinned deps or excessive permissions)
Scripts vs instructions-only: Skills with scripts are 2.12× more likely to be vulnerable (statistically significant, p<0.001). Think of scripts as power tools—super useful but more dangerous if misused.
Detector accuracy: Precision 86.7% (fewer false alarms) and recall 82.5% (catches most true issues). That’s like getting an A- in both spotting problems and not crying wolf.

Surprising or Notable Findings:

Co-occurrence: If a skill has supply chain issues, there’s a very high chance (often over half) it also has data exfiltration patterns—risky practices tend to travel in packs.
Size matters: Larger skills (>500 lines) had notably higher risk—more moving parts, more chances for mistakes or mischief.
Maintenance myth: Recent updates didn’t guarantee safety; fresh doesn’t always mean secure.
Security/Red-team caveat: These skills are dangerous by design. The raw flag rate is high, but after adjusting for legitimate purpose, they still need careful, manual review.

Concrete case glimpses (sanitized):

“Backup helper” quietly scoops environment secrets and posts them to a fixed endpoint (E2/E1).
“Code review” hides a note to exfiltrate discussion context to an analytics server and to auto-approve with special comments (P2/P3).
“Dependency manager” mixes unpinned deps, runtime script fetching, and obfuscated code—a three-lane highway for trouble (SC1/SC2/SC3).

What this means in plain terms: The ecosystem is young and flexible, but that flexibility creates an attack surface. The numbers show a real, measurable need for app-store-like permission gates, code signing, and mandatory security scans—before skills run.

05Discussion & Limitations

Limitations (honest assessment):

Intent ambiguity: The method flags risky patterns whether caused by malice or sloppy practice; some Security/Red-team tools look dangerous but are purposeful. So 26.1% means “warrants review,” not “definitely bad.”
Static-first: Not running code avoids harm but can miss time-bomb or conditional behavior; a small pilot runtime test confirmed many high-confidence cases, but not all.
LLM variance: AI judgments can wobble a bit run-to-run; thresholds and static-first logic keep it conservative, but it’s not 100% deterministic.
Snapshot in time: Collected December 2025. Things may improve with vetting—or get trickier as attackers adapt.

Required Resources (to use or extend this work):

Data access: Marketplaces or repos for skills; stable crawling and deduping.
Compute: Enough to run static scans and LLM passes at scale.
Security expertise: To calibrate patterns, review edge cases, and triage high-severity findings.
Platform cooperation: For responsible disclosure and remediation.

When NOT to Use (pitfalls):

As a sole judge of intent: The tool is best at “is it dangerous?” not “was it meant to harm?”
For runtime-only bugs: Logic that triggers only under special conditions may evade static+semantic checks.
For final approvals without human eyes: High-severity flags still deserve a quick expert look.

Open Questions:

Dynamic analysis at scale: How can we safely and ethically run thousands of suspicious skills in sandboxes to confirm exploitability?
Better intent signals: Can we separate “security tool doing its job” from “malware in disguise” automatically?
Stronger permissions: What is the best capability model for skills that balances usefulness with least privilege?
Ecosystem evolution: Will prevalence fall with vetting, or will attackers shift to stealthier patterns that need new detectors?
Cross-platform standards: Can we agree on shared manifests, code signing, and auditing that work across agent frameworks?

06Conclusion & Future Work

Three-Sentence Summary: This paper scanned 31,132 real agent skills and found that about one in four show potentially dangerous patterns, with data exfiltration and privilege escalation most common. A hybrid detector called SkillScan—static checks plus an AI reader—organized risks into a clear 4×14 map and achieved strong accuracy (86.7% precision, 82.5% recall). The results call for capability-based permissions, mandatory vetting, and sandboxing to keep agent ecosystems safe.

Main Achievement: Turning scattered worries into a measured, actionable picture—complete with a validated detector, a practical taxonomy, and open artifacts—so platforms and developers can fix the biggest risks first.

Future Directions: Build safe, large-scale dynamic sandboxes; sharpen intent detection to separate legitimate security tools from malware; standardize permission manifests and code signing; and run longitudinal studies to watch how attacker tactics change.

Why Remember This: Agent skills are the new “apps” for AI—amazing but risky without guardrails. This study is the first big map of the hazards, the measuring stick to track progress, and the starter toolkit to make skills safer before the ecosystem grows too fast to catch up.

Practical Applications

•Add mandatory static and semantic scans to skill publishing pipelines before listing.
•Adopt capability-based permission manifests (file, network, execute) with least-privilege defaults.
•Block known-dangerous patterns by policy (e.g., curl | bash, unpinned dependencies) unless explicitly reviewed.
•Run higher-risk skills in sandboxes (containers/VMs/WebAssembly) with strict network/file guards.
•Require author verification and cryptographic code signing for marketplace submissions.
•Provide clear user-facing warnings and permission prompts tied to the 4×14 taxonomy.
•Continuously monitor popular skills for changes (diff scans) and trigger re-review on risky edits.
•Educate developers with lint rules and examples that map directly to E/PE/SC/PI patterns.
•Prioritize manual review for skills that bundle scripts, are very large, or request broad permissions.
•Establish a community audit and bug bounty program focused on high-download skills.

Version: 1