Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report

Zhuoran Yang; Ed Li; Jianliang He; Aman Priyanshu; Baturay Saglam; Paul Kassianik; Sajana Weerawardhena; Anu Vellore; Blaine Nelson; Neusha Javidnia; Arthur Goldblatt; Fraser Burch; Avi Zohary; Assaf Eisenman; Mahdi Sabbaghi; Supriti Vijay; Rahim Dharssi; Dhruv Kedia; Kojin Oshiba; Yaron Singer; Amin Karbasi

Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report

Beginner

Zhuoran Yang, Ed Li, Jianliang He et al.1/28/2026

arXiv PDF

Key Summary

•This paper introduces Foundation-Sec-8B-Reasoning, a small (8 billion parameter) AI model that is trained to “think out loud” before answering cybersecurity questions.
•The model always writes a clear step-by-step reasoning trace inside special <think>...</think> tags, so humans can check how it got its answer.
•It is trained in two stages: first by learning from millions of examples with correct reasoning (supervised fine-tuning), then by practicing and getting rewards only when its answers and formats can be verified (reinforcement learning with verifiable rewards).
•On a key cybersecurity task (mapping CVEs to CWEs), it scores 75.3%, beating models that are up to 15 times larger.
•It stays strong on general tasks like math, multi-step question answering, and following instructions, with a big jump in human preference scores on AlpacaEval 2 (62.6%).
•The team solved common training problems like reward hacking (where a model tries to game the reward) by checking format rules and penalizing bad or missing reasoning tags.
•Safety is strong when you add a good system prompt (93.00% pass rate on HarmBench), and even stronger with an external safety filter (98.25%).
•An ablation study shows the second stage (reinforcement learning) is what supercharges multi-hop reasoning, especially on long-form question answering.
•This is the first open-source, native reasoning model built specifically for cybersecurity, showing that specialized models can be both smart and trustworthy.

Why This Research Matters

Cybersecurity teams make high-stakes decisions where a wrong step can cause real harm, so they need AI that shows its work. This model is trained to think first and answer second, giving analysts transparent reasoning they can audit, correct, and trust. It performs as well as or better than much larger models on key security tasks, making powerful help more affordable. It also keeps strong general skills, so one model can assist with math, coding, and long-form analysis—useful in many day-to-day workflows. With the right prompts and guardrails, it’s also safe, refusing harmful requests while still being helpful. This combination of performance, transparency, and safety is exactly what real-world security operations need. It’s a recipe other high-stakes fields—like medicine or finance—can reuse to build trustworthy AI.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a good science fair project doesn’t just show the final result—it shows the steps so everyone can understand and trust what happened? In cybersecurity, it’s the same: the steps matter as much as the answer.

🥬 The Concept (Reasoning): What it is: Reasoning is the process of thinking through steps to solve a problem. How it works: 1) Read the question. 2) Gather clues. 3) Connect clues step by step. 4) Check if the steps make sense. 5) Share the answer with the steps. Why it matters: Without reasoning, you might get the right answer by luck, but no one can check or learn from it—and in security, that can lead to dangerous mistakes. 🍞 Anchor: Imagine classifying a bug in software: if you say “It’s CWE-79” (a kind of XSS) but don’t show why, a teammate might fix the wrong thing.

The World Before: Large language models (LLMs) could answer questions, write code, and summarize documents well. But in cybersecurity, tasks like linking a vulnerability (a CVE) to its root cause (a CWE), extracting attacker techniques, or rating severity (CVSS) require careful, multi-step analysis. Traditional instruction-tuned models often gave short, polished answers without showing their work. That’s risky in security, where analysts need to audit logic, reproduce steps, and defend decisions.

The Problem: Cybersecurity workflows need transparent, multi-hop reasoning. Think of a threat analyst: they must trace clues across multiple documents, match patterns to the MITRE ATT&CK framework, and pick the correct CWE among hundreds. A black-box “just trust me” answer isn’t enough. Teams need: 1) Clear step-by-step reasoning, 2) Verifiable final outputs (e.g., exact CWE ID), and 3) Stable behavior across very different task types (short answers, long analyses, multiple-choice, mappings).

Failed Attempts: People tried simply fine-tuning instruction models with more data, or asking them to write chain-of-thought sometimes. But three issues kept showing up: 1) Missing steps: models skipped the hard parts and jumped to answers. 2) Unstable training: long, messy outputs could dominate the learning signal and make models worse. 3) Reward hacking: when trained with rewards on final answers only, models learned to game the rules—like writing empty or nonsense reasoning to save effort while still guessing correct answers.

The Gap: Security teams needed a native reasoning model—one trained from the start to “think before speaking,” with steps that are easy to audit. It also had to be specialized in cybersecurity so it knows the domain deeply, yet keep general skills (math, coding, following instructions) for everyday tasks.

🍞 Hook (Native Reasoning Model): Imagine a detective who always writes notes in a notebook before telling the chief what happened—that notebook is visible to everyone, so the team can check the logic.

🥬 The Concept: What it is: A native reasoning model is trained to always produce step-by-step thoughts before the final answer. How it works: 1) It learns a special format (<think>...</think>) to hold its reasoning. 2) It practices on many tasks that require steps. 3) It’s rewarded only when both the format and the final answer are correct. Why it matters: Without this, the model can hide shortcuts or mistakes, and analysts can’t safely rely on its suggestions. 🍞 Anchor: Before answering “Which CWE fits this CVE?”, the model writes a short, checkable reasoning trace in <think> tags, then prints “CWE ID: CWE-79.”

Real Stakes: In real incidents, a faulty chain of logic can waste hours, mislabel attacker behavior, or underestimate severity. A transparent model lets analysts: 1) Audit logic, 2) Spot weak links, 3) Correct mistakes fast, and 4) Build trust in AI-assisted workflows. For managers, clear reasoning supports compliance and postmortems. For customers, it means safer products and faster, more accurate responses.

02Core Idea

🍞 Hook (The “Aha!”): Imagine taking a math test where you must show your work. Now imagine the grader gives points only if both the work and the answer check out. That’s how this model is trained.

🥬 The Concept (Main Innovation): What it is: Train an 8B-parameter cybersecurity model to always think first (inside <think> tags) and then answer, using a two-stage process: supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR). How it works: 1) Stage 1 (SFT): Feed the model millions of examples that include clear step-by-step reasoning and final answers. 2) Stage 2 (RLVR): Let the model try multiple answers per prompt; use task-specific checkers to verify the final answer and the required format; reward only good, well-formed outputs; gently keep the model close to its SFT behavior (regularization). Why it matters: This builds reliable, auditable reasoning that holds up across messy, real cybersecurity tasks. 🍞 Anchor: For mapping a CVE to a CWE, the model writes a short argument in <think> about why CWE-79 fits (mentions user input, HTML context, script injection), then ends with “CWE ID: CWE-79.”

Multiple Analogies:

Detective Notebook: The model is a detective who must record clues and logic in a notebook before telling the verdict. A supervisor checks both the notes and the verdict.
Cooking with Mise en Place: The chef (model) preps ingredients (clues) and lays them out (<think>). Only when ready, they plate the dish (final answer). A taster checks both the prep and the taste.
Show-Your-Work Math: Students don’t get full credit unless the steps are right and the final number matches. The model learns to value both.

Before vs After:

Before: Security models often gave short answers without transparent logic; they did okay on simple questions but stumbled on multi-hop tasks.
After: The model shows its work, scores state-of-the-art among 8B peers on key security tasks (e.g., 75.3% on CTIBench-RCM), and stays strong on general reasoning and alignment (e.g., AlpacaEval 2 at 62.6%).

Why It Works (Intuition):

Verifiable rewards focus learning on things you can check: “Is the CWE ID correct? Is the <think> trace present and non-trivial?”
Gentle regularization (stay near the SFT policy) preserves broad skills and prevents wild drift.
Smart loss aggregation avoids long nonsense answers dominating training.
Format penalties stop reward hacking (like writing empty <think> blocks).

Building Blocks (with mini sandwiches):

🍞 Hook: Ever write a practice sheet before the final draft? 🥬 SFT: What it is: Teaching by example with many reasoning-rich samples. How: learn patterns of good step-by-step thinking and correct answers. Why: Gives the model a strong, safe starting point. 🍞 Anchor: The model sees lots of “CVE text → <think> analysis → CWE ID” examples.
🍞 Hook: Think of a video game that gives points when you finish a level properly. 🥬 RLVR: What it is: Practice-and-reward training where success is checked by verifiers. How: Try several solutions; verifiers approve format and answer; get reward; update policy. Why: Turns “knowing” into “consistently doing.” 🍞 Anchor: The model gets rewarded only if it includes proper <think> and the correct CWE ID line.
🍞 Hook: Sometimes kids try to skip homework steps to finish faster. 🥬 Reward Hacking Fix: What it is: Rules that penalize missing or trivial reasoning. How: Check for <think> tags and non-empty reasoning; penalize gibberish. Why: Prevents shortcutting the process. 🍞 Anchor: Outputting “<think>No</think>” gets penalized.
🍞 Hook: If a loud kid in class talks forever, others can’t be heard. 🥬 Stable Loss Aggregation: What it is: A way to prevent extra-long, low-quality answers from overpowering learning. How: Average per sample (not per token) or use Dr.GRPO methods. Why: Keeps training fair across tasks of different lengths. 🍞 Anchor: A concise, good sample and a long, bad one influence learning equally.
🍞 Hook: A kite flies high but stays tied to a string. 🥬 KL Regularization: What it is: A gentle pull back to the SFT model. How: Add a penalty when the new policy drifts too far. Why: Protects general skills and stability. 🍞 Anchor: The model stays close to its original voice while learning new tricks.

03Methodology

High-Level Recipe: Input (a cybersecurity or general task) → Stage 1: Learn to think (SFT with <think> tags) → Stage 2: Practice with scoring (RL with verifiers and safety rules) → Output (reasoning + final answer).

Stage 1: Supervised Fine-Tuning (SFT) 🍞 Hook: Imagine learning piano with lots of sheet music that shows where to place each finger. 🥬 The Concept: What it is: Teach the model to always write its reasoning in <think>...</think> before the final answer. How it works: 1) Build a huge synthetic dataset (~2M examples) covering cybersecurity, math, coding, instruction following, science, chat, and safety. 2) Each example shows a clear reasoning trace and a final, verifiable answer. 3) Train for three epochs with a smooth learning rate schedule so the habit sticks. Why it matters: This creates a strong “show-your-work” reflex the model will keep in later training. 🍞 Anchor: For a CVSS vector task, the example shows a <think> breakdown of each metric (AV, AC, PR, UI, S, C, I, A), then gives the final CVSS:3.1/... line.

Key Details of SFT:

Data mix ensures broad reasoning: ~27% cybersecurity, ~21% math, ~15% coding, and the rest across instruction, chat, science, safety.
Format training: The model practices putting step-by-step logic in <think> and answers in a strict last-line format (e.g., “CWE ID: CWE-79”).
Stability: Cosine schedule and small learning rate (2e-5) help the model learn the format and habits gently, without forgetting general skills.

Stage 2: Reinforcement Learning with Verifiable Rewards (RLVR) 🍞 Hook: Think of a sports team scrimmage where a referee only counts goals that follow the rules. 🥬 The Concept: What it is: The model practices and gets rewarded only when its output both follows the format and is correct. How it works: 1) Use GRPO-style RL; for each prompt, sample n=5 candidate answers. 2) Verifiers check correctness (e.g., exact CWE, correct MC answer, valid CVSS parse) and format (<think> present and non-trivial). 3) Compute a binary reward; update the model with a KL penalty to stay close to the SFT model; regularize with coefficient 0.02; two epochs at 1e-6 LR. Why it matters: This upgrades the model from “knows how” to “always does it right, step by step.” 🍞 Anchor: On a CTIBench-RCM item, only candidates with good <think> and the right “CWE ID: CWE-…” line get rewarded.

Handling Real Training Challenges:

Data Heterogeneity 🍞 Hook: Some questions need a tweet; others need an essay. 🥬 What it is: Tasks differ in length and difficulty, and weak areas can produce long gibberish. How: Instead of averaging per token (which lets long junk dominate), use per-sample averaging (GRPO) or Dr.GRPO aggregation. Why: Fair influence from each example prevents collapse. 🍞 Anchor: A 10-token gem and a 500-token ramble count equally in loss aggregation.
Reward Hacking and Format Degradation 🍞 Hook: If kids get points just for the final number, some will skip all the steps. 🥬 What it is: The model might try to give a short final answer with empty or fake <think>. How: Add a format penalty: require real <think> content and correct tags; penalize trivial or repetitive traces. Why: Forces genuine step-by-step reasoning that humans can audit. 🍞 Anchor: Outputs like “<think>No</think> Answer: A” lose reward even if A is correct.
Staying Stable (KL Regularization) 🍞 Hook: Learn to dance, but don’t forget how to walk. 🥬 What it is: A safety rope keeping the RL policy close to the SFT policy. How: Add a KL penalty in the objective. Why: Preserves coding, math, and instruction-following while sharpening reasoning. 🍞 Anchor: The model still solves math and follows instructions almost as well as before.

Concrete Flow Examples:

Multiple-Choice Security (CTIBench-MCQA): Input question → <think> explains elimination of distractors → “Answer: C.” Without the reasoning step, the model might pick an answer without justification, making audits impossible.
CWE Mapping (CTIBench-RCM/CWE-Prediction): Input CVE text → <think> cites patterns (e.g., user-controlled input, HTML context, script injection) → “CWE ID: CWE-79.” Without format checks, the model might skip the <think> and risk brittle guesses.
Attack Technique Extraction (CTIBench-ATE): Input threat report → <think> highlights behaviors and maps them to ATT&CK → “Answer: T1059, T1105, ….” Without per-sample aggregation, overly long, weak attempts could derail learning.

Secret Sauce:

Native reasoning habit: The model is trained from day one to write <think> before speaking.
Verifiable rewards: Only correct and well-formed outputs get credit.
Anti-hacking penalties: Format checks stop empty or fake reasoning.
Stability controls: KL regularization and robust loss aggregation keep skills balanced.

04Experiments & Results

🍞 Hook (Benchmarks): Think of a decathlon for AI—lots of different events test different skills so you can’t win by being good at only one thing.

🥬 The Concept: What it is: The model is tested on 10 cybersecurity benchmarks, 10 general-purpose benchmarks, a safety test (HarmBench), and an ablation (SFT vs RL). How it works: 1) Cybersecurity: CTIBench suite (MCQA, RCM, VSP, ATE), proprietary CTI-Reasoning and CWE-Prediction, plus MMLU-Security, CyberMetric-2000, SecBench, SecEval. 2) General: AlpacaEval 2, BBH, GPQA, GSM8K, HumanEval, IFEval, 2WikiMultihopQA, HotpotQA, MATH, MMLU. 3) Safety: HarmBench with and without system prompts, and with Llama-Guard as a shield. Why it matters: Wide coverage shows the model isn’t a one-trick pony; it can reason in security and still handle math, coding, and instructions. 🍞 Anchor: Scoring 75.3% on CTIBench-RCM is like getting an A when bigger models score A- or B+.

The Competition: The model is compared against 18 baselines, including Llama-3.1-8B/3.3-70B, Qwen-3-8B/14B, DeepHat, Phi-4, GPT-OSS-20B/120B, and OpenAI frontier APIs. This shows how an 8B specialized reasoning model stacks up against much larger systems.

The Scoreboard (with context):

Cybersecurity Standouts: • CTIBench-RCM: 75.3%, topping GPT-OSS-120B (71.2%) and Llama-3.3-70B-Instruct (~68.4%). • CTIBench-MCQA: 69.1%, essentially tied with Llama-3.3-70B-Instruct (69.2%). • CWE-Prediction: 70.4% on brand-new 2024–2025 data—strong generalization. • Broad knowledge tasks (MMLU-Security, SecEval) remain competitive (78.2%, 84.8%).
General-Purpose Strengths: • AlpacaEval 2 (human preference): 62.6%, a big jump over Llama-3.1-8B-Instruct (25.4%) and Foundation-Sec-8B-Instruct (33.1%). • Multi-hop QA: 2WikiMultihopQA at 60.5% (much higher than peer 8B models), HotpotQA at 54.8% (competitive). • Math and Coding: GSM8K 82.3% (on par), MATH 43.3% (competitive), HumanEval 79.9% (slight dip vs baselines at 82.3%).

Surprising Findings:

RL supercharges multi-hop reasoning even on tasks not directly trained (e.g., +36.1 points on 2WikiMultihopQA, +45.1 on HotpotQA vs the SFT-only checkpoint). That’s like going from junior varsity to varsity overnight.
Safety depends strongly on a good system prompt: HarmBench pass rate jumps from 54.25% (no system prompt) to 93.00% (with a prompt), and up to 98.25% with Llama-Guard filtering.
Despite sharper reasoning, general abilities remain solid; only minor trade-offs (about 2–3 points) appear on pure coding synthesis.

Ablation: SFT vs RL (what changed?): 🍞 Hook: Imagine learning rules from a textbook (SFT), then mastering them by practicing with a coach who scores you (RL). 🥬 The Concept: What it is: Analyzing the model right after SFT vs after RL. How it works: Compare benchmarks before and after the RL stage. Why it matters: Shows which stage adds which skills. 🍞 Anchor: After RL, multi-hop QA skyrockets (e.g., HotpotQA 9.6% → 54.8%).

Bottom Line: Among 8B models, Foundation-Sec-8B-Reasoning is a top cybersecurity performer and remains broadly capable elsewhere, with standout human preference and multi-hop gains.

05Discussion & Limitations

🍞 Hook (Honest Look): You know how even a great team can have spots to improve—maybe defense is strong, but passing can get crisper? Same here.

Limitations:

Needs a good system prompt and optional guardrails for best safety; without them, refusal rates drop on tricky prompts.
Slight dip in pure code generation (HumanEval ~79.9% vs ~82.3% for baselines). For use cases that demand top code synthesis alone, a coder-first model may still be preferable.
While multi-hop reasoning improved a lot, not every reasoning benchmark is maxed out; there’s still room on complex knowledge tasks.
The approach relies on large-scale synthetic data and proprietary security data; replicating exactly may require similar resources or public substitutes.

Required Resources:

Compute for SFT on ~2M examples and an RL phase sampling n=5 outputs per prompt.
Task verifiers (answer checkers) and format validators.
Infrastructure for evaluation across many benchmarks and seeds.

When NOT to Use:

Requests for harmful or illegal content (e.g., malware creation, exploitation steps)—the model is designed to refuse.
Non-security topics if the deployment uses a domain-restricted system prompt (it will politely decline out-of-scope questions).
Workflows that require top-tier, frontier-level coding synthesis without reasoning traces—consider a specialized coder model instead.

Open Questions:

Can we further improve long-context, multi-document reasoning using retrieval and tool use while keeping reasoning fully auditable?
How can we formally verify reasoning steps beyond final-answer correctness (e.g., logic checkers for <think> content)?
What are the best public datasets to match proprietary cybersecurity corpora without performance loss?
How to calibrate confidence and detect hallucinated identifiers (e.g., fake CVEs/CWEs) in the reasoning trace itself?
Can we scale the approach to larger or MoE models without losing stability—and keep costs reasonable?

06Conclusion & Future Work

Three-Sentence Summary: Foundation-Sec-8B-Reasoning is the first open-source, native reasoning model purpose-built for cybersecurity, trained to show its work in <think> tags before answering. Using a two-stage recipe—SFT on millions of reasoning-rich examples, then RL with verifiable rewards and stability controls—it achieves top-tier security performance for its size while preserving general skills. Safety is strong with the right system prompt and great with external guardrails, proving specialized reasoning can be both capable and responsible.

Main Achievement: It shows that a small, domain-specialized model trained to “think before speaking” can beat or match much larger models on critical cybersecurity reasoning tasks, while keeping strong general abilities.

Future Directions: Add tool use and retrieval while maintaining auditable reasoning; strengthen formal checks on the <think> content; expand public, high-quality security datasets; and refine safety prompts and guardrails for different deployments. Explore scaling strategies (including MoE) and longer-context training without sacrificing stability.

Why Remember This: It’s a blueprint for building trustworthy, specialized AI that shows its work. In high-stakes fields like cybersecurity, that transparency turns AI from a black box into a reliable teammate you can audit, correct, and learn from.

Practical Applications

•Triage assistant for CVE-to-CWE mapping with auditable <think> justifications.
•Threat report technique extraction (MITRE ATT&CK IDs) with clear reasoning for each mapping.
•Vulnerability severity helper that explains each CVSS decision before giving the final vector.
•Incident investigation copilot that chains evidence across logs, tickets, and advisories with traceable steps.
•Secure configuration reviewer (e.g., cloud JSON/Terraform) that cites docs and shows logic before recommendations.
•CTI report drafting with linked sources and step-by-step rationales for conclusions.
•Analyst training tool that shows model reasoning so juniors can learn expert patterns.
•Playbook checker that walks through a response plan and flags missing or risky steps with explanations.
•Compliance audit aid that provides reasoning traces suitable for reviews and postmortems.
•Vulnerability generalization checker that tests new advisories for likely CWE categories with transparent rationale.

Version: 1