CL-bench: A Benchmark for Context Learning

Shihan Dou; Ming Zhang; Zhangyue Yin; Chenhao Huang; Yujiong Shen; Junzhe Wang; Jiayi Chen; Yuchen Ni; Junjie Ye; Cheng Zhang; Huaibing Xie; Jianglu Hu; Shaolei Wang; Weichao Wang; Yanling Xiao; Yiting Liu; Zenan Xu; Zhen Guo; Pluto Zhou; Tao Gui; Zuxuan Wu; Xipeng Qiu; Qi Zhang; Xuanjing Huang; Yu-Gang Jiang; Di Wang; Shunyu Yao

CL-bench: A Benchmark for Context Learning

Beginner

Shihan Dou, Ming Zhang, Zhangyue Yin et al.2/3/2026

arXiv PDF

Key Summary

•CL-bench is a new test that checks whether AI can truly learn new things from the information you give it right now, not just from what it memorized before.
•It includes 500 real-world style contexts, 1,899 tasks, and 31,607 checkable rubrics written by experts.
•Every task is solvable using only the provided context, so no web search is needed and pretraining knowledge can’t save the day.
•Top frontier models only solved about 17.2% on average, and the best model reached 23.7%, showing big room for improvement.
•The benchmark covers four types of learning: domain knowledge reasoning, rule system application, procedural task execution, and empirical discovery and simulation.
•Models struggled most with discovering rules from data or working in simulations (about 11.8% success), which is harder than just applying written rules.
•Most failures came from ignoring or misusing the given context, and many answers broke simple format rules even when the knowledge was right.
•Longer inputs made things much harder: performance fell as contexts grew from a few thousand tokens to 32K+.
•Extra “thinking” helped a bit for some models, but not always; newer versions weren’t automatically better at context learning.
•CL-bench provides a clean, realistic way to measure and improve the core skill of learning from context, which is essential for real-world AI assistants.

Why This Research Matters

In the real world, we constantly face new rules, tools, and data, and we succeed by learning from what’s in front of us. CL-bench measures whether AI can do the same: read fresh material, understand it, and use it correctly. This is key for safe operations like following updated safety procedures, complying with changing laws, or fixing new devices. If AI can’t learn from context, it guesses from old memories, which leads to mistakes, risks, and wasted time. By revealing where models fail—ignoring details, misusing rules, or breaking formats—CL-bench shows exactly what to improve. As models get better on CL-bench, they become more useful, trustworthy assistants that can adapt to today’s instructions, not yesterday’s internet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re joining a new after-school club. They hand you a handbook with brand-new rules no one taught you before. If you can read that handbook, learn the rules, and then play correctly today, you’re doing context learning.

🥬 Concept 1 — Prompt Engineering

What it is: It’s the art of asking AI clear questions so it knows exactly what you want.
How it works:
1. Write a clear instruction (what to do).
2. Add any needed details (style, steps, length).
3. Check that the answer matches your instruction.
Why it matters: Without good prompts, AI guesses and often wanders off-topic. 🍞 Anchor: Like telling a friend, “Please bring a blue notebook and a pencil to math class today,” instead of just saying “Bring stuff.”

🥬 Concept 2 — In-Context Learning (ICL)

What it is: A model learns a task pattern from a few examples in the prompt (like showing it a couple of ‘question → answer’ pairs).
How it works:
1. Provide 2–5 examples of how to do the task.
2. Ask a new question in the same format.
3. The model imitates the pattern it saw.
Why it matters: Without ICL, the model may not know the format or style you expect. 🍞 Anchor: Show a friend two solved math problems, then give a third one in the same style—your friend copies the pattern.

🥬 Concept 3 — Long-Context Reasoning

What it is: Keeping track of lots of information across a long story and using the right parts at the right time.
How it works:
1. Read and store key facts from far-apart places.
2. Find what’s relevant to the current question.
3. Connect the dots without losing track.
Why it matters: If you forget page 2 while on page 50, answers go wrong. 🍞 Anchor: Remembering details from early movie scenes to solve a twist at the end.

🥬 Concept 4 — Context Learning (the star of this paper)

What it is: The model learns brand-new knowledge given right now in the context and applies it to solve tasks.
How it works:
1. Read new information (rules, manuals, data, laws) in the prompt.
2. Extract the specifics that matter.
3. Use them to reason and answer, even if they conflict with what it memorized before.
Why it matters: Real life changes fast—new products, new procedures, new laws. If AI can’t learn from the latest context, it gives outdated or wrong answers. 🍞 Anchor: A robot reading a fresh appliance manual and then fixing your new dishwasher correctly—even though it never saw that model before.

The world before: Most evaluations praised AIs for using pre-trained knowledge plus good prompts (prompt engineering) and simple few-shot examples (ICL). That works when tasks match what the model already ‘knows.’ But in the real world, problems often require brand-new, task-specific information: a new programming language, an updated safety checklist, or experimental data that hides a law you must discover.

The problem: Benchmarks rarely test whether models actually learn new knowledge from long, messy, realistic contexts. Many benchmarks mix two difficulties—finding the right context (retrieval) and using it (learning)—so we can’t tell if failure comes from bad search or poor learning.

Failed attempts: Long-context tests often check reading or retrieval, not deep learning and application. Instruction-following tests check rule obedience, not domain learning. Tool-using tests confound retrieval problems with learning problems.

The gap: We needed a clean test where all the necessary new knowledge is already in the context (no web lookup), crafted by experts to be new or niche, and checked by detailed rubrics. Then we can ask: can the model truly learn from what it just read?

Real stakes: Think troubleshooting a factory robot from a new manual, following a city’s latest building code, or discovering a physical law from sensor logs. If AI can’t learn from context, it will keep guessing from old memories—unsafe and unhelpful. That’s why CL-bench matters: it spotlights the missing skill and measures it directly.

02Core Idea

🍞 Hook: You know how a great substitute teacher walks into a classroom, quickly reads the day’s plan, and then teaches smoothly—even though it’s all new? That’s the superpower this paper tests in AI.

Aha! Moment in one sentence: If we want AI to work in the real world, we must measure and improve its ability to learn new knowledge from the context we give it right now—so we built CL-bench to test exactly that.

🥬 Concept 5 — Benchmark (CL-bench)

What it is: A big, careful test set that measures how well AIs learn from context.
How it works:
1. Experts craft realistic contexts with new or niche knowledge.
2. Tasks require using that context (no outside retrieval needed).
3. Detailed rubrics check if answers truly applied the context.
Why it matters: Without a precise test, we can’t tell if models are really learning or just guessing. 🍞 Anchor: Like a science fair rubric that checks every part of your experiment step by step.

Multiple analogies for the idea:

Detective analogy: The model must read today’s case file (context) and solve the mystery, not rely on old cases in its head.
Sports playbook analogy: The coach hands a brand-new playbook mid-game; the team must read it and run the plays correctly right away.
Travel analogy: You land in a new country, read the local rules, and drive safely—no prior knowledge needed, just careful reading and applying.

Before vs. after:

Before: We mostly tested how well models used pre-trained knowledge with neat prompts and a few examples.
After: We test whether models can pick up brand-new rules, procedures, or patterns from the context and apply them correctly—much closer to real life.

Why it works (intuition, not equations):

By making the needed knowledge live only inside the given context, we force the model to pay attention to that content rather than its memory.
By using many small, binary rubrics, we can check the result from multiple angles (facts, calculations, steps, and format).
By covering four different context types (rules, procedures, domains, and data-driven discovery), we test different ‘learning muscles,’ not just reading.

Building blocks of CL-bench:

Fresh, realistic contexts: product manuals, legal systems (even fictional), technical standards, datasets, simulations.
Tasks that demand learning: multi-step problems, rule application, procedure execution, or law discovery—always solvable from the given text/data.
Rubric-based verification: many yes/no checks ensure completeness and correctness.
Difficulty levers: long contexts, multi-turn dependencies, and conflicts with pre-trained knowledge to ensure true learning from the provided material.
Clean scope: no tool calls or retrieval—so failures reflect learning issues, not search misses.

🥬 Concept 6 — Task-Level Rubrics

What it is: A checklist of tiny yes/no questions to judge the answer.
How it works:
1. Experts write multiple binary checks per task.
2. A verifying model compares the solution to the rubrics.
3. Passing all rubrics = solved; any miss = not solved.
Why it matters: Without fine-grained checks, we’d miss subtle mistakes or partial solutions that skip key rules. 🍞 Anchor: Like a baking contest card: “Used the right flour? Yes/No. Baked 30 minutes? Yes/No. Frosting matches recipe? Yes/No.”

🥬 Concept 7 — Domain Knowledge Reasoning

What it is: Learning expert knowledge from a provided text and using it to reason (e.g., finance, healthcare, legal advice).
How it works:
1. Read the expert material in context.
2. Identify relevant facts and principles.
3. Apply them carefully to the case.
Why it matters: Without it, models give generic or unsafe advice. 🍞 Anchor: Reading a hospital’s new dosage guideline and then recommending the right, safe dose for a specific patient case.

🥬 Concept 8 — Rule System Application

What it is: Learning a brand-new formal rule set (game mechanics, math system, programming syntax, or regulations) and applying it.
How it works:
1. Extract exact rules and constraints.
2. Translate the task into rule steps.
3. Check for compliance before finalizing.
Why it matters: Without this, models break rules or mix in old habits. 🍞 Anchor: Learning a custom board game from its booklet and then making only legal moves.

🥬 Concept 9 — Procedural Task Execution

What it is: Following step-by-step procedures from manuals or workflows to complete operations.
How it works:
1. Find the right procedure in the document.
2. Execute steps in order with correct parameters.
3. Respect safety and formatting rules.
Why it matters: Skipping one step can cause errors or hazards. 🍞 Anchor: Turning a drone delivery request into exact, documented pseudocode with the right safety checks.

🥬 Concept 10 — Empirical Discovery & Simulation

What it is: Finding patterns or laws from data, or reasoning inside a simulated environment with rules that emerge from evidence.
How it works:
1. Analyze data or environment logs.
2. Induce patterns or test hypotheses.
3. Apply the discovered law to answer or act.
Why it matters: Without it, AI can’t learn from experiments or adapt to dynamic worlds. 🍞 Anchor: Noticing helical motion and measuring a pitch angle from physics data to infer a particle’s behavior.

Bottom line: CL-bench makes context learning visible and measurable, revealing which models can ‘read, learn, and do’—and which still guess from memory.

03Methodology

At a high level: System prompt + Context + Task → Model reads and learns new knowledge → Model produces a solution → A verifier checks many rubrics → Pass all = solved.

Step-by-step recipe:

Expert-built contexts with fresh knowledge

What happens: Domain experts create 500 realistic contexts (manuals, laws, data logs, simulations, technical standards). The knowledge is new, niche, or intentionally modified so pretraining can’t give the answer.
Why this exists: If knowledge were common on the internet, models might ‘remember’ it instead of learning from the given text.
Example data: A fictional country’s legal system with unique precedents, or a new drone SDK with exact function names and error codes.

Tasks that force learning from the provided context

What happens: Experts design 1,899 tasks that rely only on the accompanying context—no web search, no tools.
Why this exists: To isolate the learning skill; we want failures to reflect not learning, not missing documents.
Example: “Convert this urgent hazmat delivery request into valid pseudocode using only documented safety functions.”

Multi-turn sequences and dependencies

What happens: About half the tasks appear in sequences; later tasks depend on earlier correct answers.
Why this exists: Real work unfolds over time; we test whether models can keep track and build on what they learned.
Example: A series of legal decisions where each ruling uses principles established earlier.

Long, realistic inputs

What happens: Contexts can be tens of thousands of tokens long, mixing rules, references, and details.
Why this exists: Real manuals and data logs are long; strong models must keep track across distance.
Example: API docs spanning navigation, payload, and safety sections for drones.

Task-level rubrics for rigorous, automatic checking

What happens: For each task, experts author multiple yes/no checks covering correctness, calculation, judgment, procedure, completeness, and format.
Why this exists: Complex tasks need multi-angle checking so partial answers don’t sneak by.
Example rubric: “Did the solution call Safety_request_airspace() before launch?”

LM-as-a-judge verification

What happens: A strong model (GPT-5.1) acts as the verifier, comparing the solution to rubrics and providing rationales.
Why this exists: Many answers are open-ended; binary rules alone aren’t enough. A capable judge plus detailed rubrics gives reliable scoring.
Example: Human spot-checks showed over 90% agreement; cross-verifier checks (Claude, Qwen) also showed >90% raw agreement.

Strict success criterion

What happens: A task is ‘solved’ only if all rubrics pass.
Why this exists: Real operations break if any crucial safety or format step is missed.
Example: A drone plan that forgets airspace permission is unsafe, even if the route is fine.

🥬 Concept 11 — Contamination-Free Design

What it is: Ensuring the needed knowledge wasn’t learned during pretraining.
How it works:
1. Use fictional creation (new laws, new languages).
2. Modify existing content (alter definitions or events).
3. Include niche/emerging materials (long-tail docs).
Why it matters: If the model already ‘knows’ the answer, we can’t measure learning from the given context. 🍞 Anchor: A brand-new board game you’ve never seen—so you must read the booklet now to play.

🥬 Concept 12 — Sequential Multi-Turn Tasks

What it is: Sets of tasks where later ones depend on earlier answers.
How it works:
1. Present Task 1 and solve it using the context.
2. Present Task 2 that builds on Task 1.
3. Continue the chain, testing memory and consistency.
Why it matters: Real jobs are chains, not single trivia questions. 🍞 Anchor: Cooking a meal course-by-course; if you undercook the main dish, dessert timing also goes off.

🥬 Concept 13 — LM-as-a-Judge

What it is: Using a strong model to check if each rubric is satisfied.
How it works:
1. Feed the solution and rubrics to the verifier.
2. The verifier answers each rubric: Yes/No + rationale.
3. Aggregate results; all-Yes → solved.
Why it matters: Some answers need nuanced judgment that simple scripts can’t capture. 🍞 Anchor: A referee reading the rulebook while judging a tricky play in sports.

The secret sauce:

Clean separation: Because all knowledge sits inside the context, this benchmark spotlights context learning (not retrieval).
Many tiny checks: Rubrics catch subtle misses (like using the wrong function or skipping one safety step).
Diverse context types: Different skills (deductive rule use vs. inductive discovery) reveal which ‘learning muscles’ are weak.
Realistic difficulty: Long inputs, multi-step dependencies, and formatting constraints feel like real work.

Concrete examples in action:

Drone SDK case: The model must refuse a non-existent function, follow safety modules, include task parameters (e.g., drone ID D-998, Sector 4), and output in the documented pseudocode format.
Physics data case: From position/time logs of a charged particle, infer helical motion and compute the pitch angle via tanθ = v⊥/v∥ (e.g., about 27°), as stated in the context.

Together, these steps turn ‘learn from what you just read’ into a clear, fair, and hard test.

04Experiments & Results

The test: We measured ‘task solving rate’—does the model pass all rubrics for a task? We looked across four big categories: domain knowledge reasoning, rule system application, procedural task execution, and empirical discovery & simulation.

The competition: Ten frontier models using their official APIs and reasoning settings. The lineup included GPT-5.1, GPT-5.2, o3, Claude Opus 4.5, Kimi K2, HY 2.0, Gemini 3 Pro, Qwen 3 Max, DeepSeek V3.2, and Doubao 1.6.

The scoreboard (with context):

Overall average: 17.2% solve rate. That’s like most students getting a D on a brand-new unit they had to learn from a handout.
Best model: GPT-5.1 at 23.7% (around a solid C), still far from mastery.
Category difficulty:
- Domain knowledge reasoning and procedural tasks were relatively more approachable (top models around low- to mid-20s%).
- Rule system application varied widely: legal & regulatory was easier (some models >30–40%); mathematical formalisms were harder (<15% for many).
- Empirical discovery & simulation was toughest (around 11.8% on average). Inducing laws from data or operating in simulations strained models most.

Surprising findings:

Newer isn’t always better: GPT-5.2 underperformed GPT-5.1 by 5.6% overall, especially on experimental data and some management tasks. It struggled to maintain consistent reasoning across long contexts and sometimes ignored explicit constraints.
More ‘thinking’ helps… sometimes: Increasing reasoning effort raised scores a little for several models (e.g., +2–6%), but not universally. For GPT-5.2 and some others, gains were small or uneven.
Length hurts: All models dropped as inputs grew longer. From 0–4K tokens to 32K+, many fell from near 25–35% to around 5–10%.
Error anatomy: The biggest causes were ‘context ignored’ and ‘context misused’ (often >60% of failures combined). Format errors also stayed high (mid-30s to >40%), showing instruction-following gaps.

🍞 Hook for a key pattern: Think of two kinds of schoolwork—following a recipe vs. discovering a pattern from science lab data.

🥬 Concept 14 — Inductive (data-driven) vs. Deductive (rule-driven) Reasoning

What it is: Deduction applies explicit rules to cases; induction discovers the rules from examples/data.
How it works:
1. Deduction: Read rule → apply to situation → get answer.
2. Induction: Study data → find pattern/law → use it to answer.
Why it matters: Induction is harder for current models; CL-bench shows a bigger drop there. 🍞 Anchor: Deduction is using a math formula given in class; induction is figuring out a new formula from many measurements.

Big picture: Even our best models don’t yet read-long, learn-new, and do-complex reliably. The numbers expose a core skill gap: paying close attention to the provided material, truly understanding it, and executing with both reasoning and formatting discipline.

05Discussion & Limitations

Limitations:

Coverage: Even with 500 contexts and 18 subcategories, the real world is bigger. Ultra-specialized fields and odd edge-cases aren’t fully represented.
Interaction length: Many tasks are single or short multi-turn. Real deployments often need long, evolving dialogues with corrections and memory.
Modality: It’s text-only for now. Real work often mixes diagrams, logs, audio, and video.
Human baselines: We don’t yet have controlled human scores from non-experts learning on the spot, which would help calibrate difficulty.

Required resources to use CL-bench:

Access to long-context-capable LMs and a consistent reasoning setting.
Enough context window (up to ~65K tokens) and compute budget for multiple runs.
A reliable verifier setup (e.g., GPT-5.1-as-judge) and data handling for thousands of rubrics.

When not to use:

If your system relies mainly on retrieval quality or tool orchestration, CL-bench won’t isolate those. It’s designed to test learning from already-provided context.
If your tasks are trivial or don’t require new knowledge, simpler benchmarks will suffice.

Open questions:

Training: What data and curricula best teach models to privilege context over pretraining when they conflict?
Architecture: Do we need new memory or multi-pass designs to digest long, dense contexts reliably?
Feedback: Can we generate rubric-style feedback automatically to scale up training and evaluation?
Stability: How do we make behavior more consistent across long contexts and inductive tasks?

Honest assessment: CL-bench squarely reveals that ‘read the doc, then do the job’ is still hard for today’s top models—especially discovering rules from data and staying faithful to every instruction and format constraint. That’s exactly the kind of real-world skill we need next.

06Conclusion & Future Work

Three-sentence summary: Real-world AI must learn from the information you hand it right now, not just from what it memorized years ago. CL-bench cleanly measures that skill by giving fresh, expert-crafted contexts and checking solutions with many fine-grained rubrics. Today’s frontier models solve only about 17–24%, proving that context learning is the next big hill to climb.

Main achievement: CL-bench isolates and rigorously tests context learning—learning new rules, procedures, and laws directly from the provided material—across diverse, realistic scenarios with reliable automatic verification.

Future directions:

Train with context-aware curricula and synthetic rubric feedback.
Explore architectures with better long-term memory and multi-pass reading.
Extend to multimodal contexts and longer interactive sessions.
Add human baselines for clearer difficulty calibration.

Why remember this: As AI moves from trivia to tasks, its value depends on reading today’s rules, learning them instantly, and executing safely. CL-bench shines a bright light on that exact skill and shows we’re not there yet—guiding the path to smarter, more trustworthy assistants.

Practical Applications

•Evaluate vendor models before deployment by running them on CL-bench tasks aligned to your domain (e.g., safety procedures or compliance).
•Train internal models with curriculum derived from CL-bench patterns (short contexts → longer, single-turn → multi-turn, deductive → inductive).
•Add rubric-style checks to your production prompts to reduce format errors and catch missing safety steps.
•Use CL-bench categories to diagnose weaknesses (e.g., if empirical discovery is low, add data-analysis fine-tuning or tool assistance).
•Design your context packs like CL-bench: make all required info present, explicit, and well-structured to improve success rates.
•Stress-test long-context performance by gradually increasing context size and tracking solve-rate drops.
•Adopt LM-as-a-judge internally to auto-verify complex outputs against task-specific rubrics before acting.
•Create contamination-free onboarding docs for new workflows to ensure models must learn from your latest procedures.
•Prioritize instruction-following hygiene: enforce strict output templates and include format rubrics to cut preventable failures.
•Track error types (ignored/misused context vs. formatting) to pick the right mitigation (prompt fixes, training data, or architecture changes).

Version: 1