CL-bench: A Benchmark for Context Learning
Key Summary
- âąCL-bench is a new test that checks whether AI can truly learn new things from the information you give it right now, not just from what it memorized before.
- âąIt includes 500 real-world style contexts, 1,899 tasks, and 31,607 checkable rubrics written by experts.
- âąEvery task is solvable using only the provided context, so no web search is needed and pretraining knowledge canât save the day.
- âąTop frontier models only solved about 17.2% on average, and the best model reached 23.7%, showing big room for improvement.
- âąThe benchmark covers four types of learning: domain knowledge reasoning, rule system application, procedural task execution, and empirical discovery and simulation.
- âąModels struggled most with discovering rules from data or working in simulations (about 11.8% success), which is harder than just applying written rules.
- âąMost failures came from ignoring or misusing the given context, and many answers broke simple format rules even when the knowledge was right.
- âąLonger inputs made things much harder: performance fell as contexts grew from a few thousand tokens to 32K+.
- âąExtra âthinkingâ helped a bit for some models, but not always; newer versions werenât automatically better at context learning.
- âąCL-bench provides a clean, realistic way to measure and improve the core skill of learning from context, which is essential for real-world AI assistants.
Why This Research Matters
In the real world, we constantly face new rules, tools, and data, and we succeed by learning from whatâs in front of us. CL-bench measures whether AI can do the same: read fresh material, understand it, and use it correctly. This is key for safe operations like following updated safety procedures, complying with changing laws, or fixing new devices. If AI canât learn from context, it guesses from old memories, which leads to mistakes, risks, and wasted time. By revealing where models failâignoring details, misusing rules, or breaking formatsâCL-bench shows exactly what to improve. As models get better on CL-bench, they become more useful, trustworthy assistants that can adapt to todayâs instructions, not yesterdayâs internet.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine youâre joining a new after-school club. They hand you a handbook with brand-new rules no one taught you before. If you can read that handbook, learn the rules, and then play correctly today, youâre doing context learning.
đ„Ź Concept 1 â Prompt Engineering
- What it is: Itâs the art of asking AI clear questions so it knows exactly what you want.
- How it works:
- Write a clear instruction (what to do).
- Add any needed details (style, steps, length).
- Check that the answer matches your instruction.
- Why it matters: Without good prompts, AI guesses and often wanders off-topic. đ Anchor: Like telling a friend, âPlease bring a blue notebook and a pencil to math class today,â instead of just saying âBring stuff.â
đ„Ź Concept 2 â In-Context Learning (ICL)
- What it is: A model learns a task pattern from a few examples in the prompt (like showing it a couple of âquestion â answerâ pairs).
- How it works:
- Provide 2â5 examples of how to do the task.
- Ask a new question in the same format.
- The model imitates the pattern it saw.
- Why it matters: Without ICL, the model may not know the format or style you expect. đ Anchor: Show a friend two solved math problems, then give a third one in the same styleâyour friend copies the pattern.
đ„Ź Concept 3 â Long-Context Reasoning
- What it is: Keeping track of lots of information across a long story and using the right parts at the right time.
- How it works:
- Read and store key facts from far-apart places.
- Find whatâs relevant to the current question.
- Connect the dots without losing track.
- Why it matters: If you forget page 2 while on page 50, answers go wrong. đ Anchor: Remembering details from early movie scenes to solve a twist at the end.
đ„Ź Concept 4 â Context Learning (the star of this paper)
- What it is: The model learns brand-new knowledge given right now in the context and applies it to solve tasks.
- How it works:
- Read new information (rules, manuals, data, laws) in the prompt.
- Extract the specifics that matter.
- Use them to reason and answer, even if they conflict with what it memorized before.
- Why it matters: Real life changes fastânew products, new procedures, new laws. If AI canât learn from the latest context, it gives outdated or wrong answers. đ Anchor: A robot reading a fresh appliance manual and then fixing your new dishwasher correctlyâeven though it never saw that model before.
The world before: Most evaluations praised AIs for using pre-trained knowledge plus good prompts (prompt engineering) and simple few-shot examples (ICL). That works when tasks match what the model already âknows.â But in the real world, problems often require brand-new, task-specific information: a new programming language, an updated safety checklist, or experimental data that hides a law you must discover.
The problem: Benchmarks rarely test whether models actually learn new knowledge from long, messy, realistic contexts. Many benchmarks mix two difficultiesâfinding the right context (retrieval) and using it (learning)âso we canât tell if failure comes from bad search or poor learning.
Failed attempts: Long-context tests often check reading or retrieval, not deep learning and application. Instruction-following tests check rule obedience, not domain learning. Tool-using tests confound retrieval problems with learning problems.
The gap: We needed a clean test where all the necessary new knowledge is already in the context (no web lookup), crafted by experts to be new or niche, and checked by detailed rubrics. Then we can ask: can the model truly learn from what it just read?
Real stakes: Think troubleshooting a factory robot from a new manual, following a cityâs latest building code, or discovering a physical law from sensor logs. If AI canât learn from context, it will keep guessing from old memoriesâunsafe and unhelpful. Thatâs why CL-bench matters: it spotlights the missing skill and measures it directly.
02Core Idea
đ Hook: You know how a great substitute teacher walks into a classroom, quickly reads the dayâs plan, and then teaches smoothlyâeven though itâs all new? Thatâs the superpower this paper tests in AI.
Aha! Moment in one sentence: If we want AI to work in the real world, we must measure and improve its ability to learn new knowledge from the context we give it right nowâso we built CL-bench to test exactly that.
đ„Ź Concept 5 â Benchmark (CL-bench)
- What it is: A big, careful test set that measures how well AIs learn from context.
- How it works:
- Experts craft realistic contexts with new or niche knowledge.
- Tasks require using that context (no outside retrieval needed).
- Detailed rubrics check if answers truly applied the context.
- Why it matters: Without a precise test, we canât tell if models are really learning or just guessing. đ Anchor: Like a science fair rubric that checks every part of your experiment step by step.
Multiple analogies for the idea:
- Detective analogy: The model must read todayâs case file (context) and solve the mystery, not rely on old cases in its head.
- Sports playbook analogy: The coach hands a brand-new playbook mid-game; the team must read it and run the plays correctly right away.
- Travel analogy: You land in a new country, read the local rules, and drive safelyâno prior knowledge needed, just careful reading and applying.
Before vs. after:
- Before: We mostly tested how well models used pre-trained knowledge with neat prompts and a few examples.
- After: We test whether models can pick up brand-new rules, procedures, or patterns from the context and apply them correctlyâmuch closer to real life.
Why it works (intuition, not equations):
- By making the needed knowledge live only inside the given context, we force the model to pay attention to that content rather than its memory.
- By using many small, binary rubrics, we can check the result from multiple angles (facts, calculations, steps, and format).
- By covering four different context types (rules, procedures, domains, and data-driven discovery), we test different âlearning muscles,â not just reading.
Building blocks of CL-bench:
- Fresh, realistic contexts: product manuals, legal systems (even fictional), technical standards, datasets, simulations.
- Tasks that demand learning: multi-step problems, rule application, procedure execution, or law discoveryâalways solvable from the given text/data.
- Rubric-based verification: many yes/no checks ensure completeness and correctness.
- Difficulty levers: long contexts, multi-turn dependencies, and conflicts with pre-trained knowledge to ensure true learning from the provided material.
- Clean scope: no tool calls or retrievalâso failures reflect learning issues, not search misses.
đ„Ź Concept 6 â Task-Level Rubrics
- What it is: A checklist of tiny yes/no questions to judge the answer.
- How it works:
- Experts write multiple binary checks per task.
- A verifying model compares the solution to the rubrics.
- Passing all rubrics = solved; any miss = not solved.
- Why it matters: Without fine-grained checks, weâd miss subtle mistakes or partial solutions that skip key rules. đ Anchor: Like a baking contest card: âUsed the right flour? Yes/No. Baked 30 minutes? Yes/No. Frosting matches recipe? Yes/No.â
đ„Ź Concept 7 â Domain Knowledge Reasoning
- What it is: Learning expert knowledge from a provided text and using it to reason (e.g., finance, healthcare, legal advice).
- How it works:
- Read the expert material in context.
- Identify relevant facts and principles.
- Apply them carefully to the case.
- Why it matters: Without it, models give generic or unsafe advice. đ Anchor: Reading a hospitalâs new dosage guideline and then recommending the right, safe dose for a specific patient case.
đ„Ź Concept 8 â Rule System Application
- What it is: Learning a brand-new formal rule set (game mechanics, math system, programming syntax, or regulations) and applying it.
- How it works:
- Extract exact rules and constraints.
- Translate the task into rule steps.
- Check for compliance before finalizing.
- Why it matters: Without this, models break rules or mix in old habits. đ Anchor: Learning a custom board game from its booklet and then making only legal moves.
đ„Ź Concept 9 â Procedural Task Execution
- What it is: Following step-by-step procedures from manuals or workflows to complete operations.
- How it works:
- Find the right procedure in the document.
- Execute steps in order with correct parameters.
- Respect safety and formatting rules.
- Why it matters: Skipping one step can cause errors or hazards. đ Anchor: Turning a drone delivery request into exact, documented pseudocode with the right safety checks.
đ„Ź Concept 10 â Empirical Discovery & Simulation
- What it is: Finding patterns or laws from data, or reasoning inside a simulated environment with rules that emerge from evidence.
- How it works:
- Analyze data or environment logs.
- Induce patterns or test hypotheses.
- Apply the discovered law to answer or act.
- Why it matters: Without it, AI canât learn from experiments or adapt to dynamic worlds. đ Anchor: Noticing helical motion and measuring a pitch angle from physics data to infer a particleâs behavior.
Bottom line: CL-bench makes context learning visible and measurable, revealing which models can âread, learn, and doââand which still guess from memory.
03Methodology
At a high level: System prompt + Context + Task â Model reads and learns new knowledge â Model produces a solution â A verifier checks many rubrics â Pass all = solved.
Step-by-step recipe:
- Expert-built contexts with fresh knowledge
- What happens: Domain experts create 500 realistic contexts (manuals, laws, data logs, simulations, technical standards). The knowledge is new, niche, or intentionally modified so pretraining canât give the answer.
- Why this exists: If knowledge were common on the internet, models might ârememberâ it instead of learning from the given text.
- Example data: A fictional countryâs legal system with unique precedents, or a new drone SDK with exact function names and error codes.
- Tasks that force learning from the provided context
- What happens: Experts design 1,899 tasks that rely only on the accompanying contextâno web search, no tools.
- Why this exists: To isolate the learning skill; we want failures to reflect not learning, not missing documents.
- Example: âConvert this urgent hazmat delivery request into valid pseudocode using only documented safety functions.â
- Multi-turn sequences and dependencies
- What happens: About half the tasks appear in sequences; later tasks depend on earlier correct answers.
- Why this exists: Real work unfolds over time; we test whether models can keep track and build on what they learned.
- Example: A series of legal decisions where each ruling uses principles established earlier.
- Long, realistic inputs
- What happens: Contexts can be tens of thousands of tokens long, mixing rules, references, and details.
- Why this exists: Real manuals and data logs are long; strong models must keep track across distance.
- Example: API docs spanning navigation, payload, and safety sections for drones.
- Task-level rubrics for rigorous, automatic checking
- What happens: For each task, experts author multiple yes/no checks covering correctness, calculation, judgment, procedure, completeness, and format.
- Why this exists: Complex tasks need multi-angle checking so partial answers donât sneak by.
- Example rubric: âDid the solution call Safety_request_airspace() before launch?â
- LM-as-a-judge verification
- What happens: A strong model (GPT-5.1) acts as the verifier, comparing the solution to rubrics and providing rationales.
- Why this exists: Many answers are open-ended; binary rules alone arenât enough. A capable judge plus detailed rubrics gives reliable scoring.
- Example: Human spot-checks showed over 90% agreement; cross-verifier checks (Claude, Qwen) also showed >90% raw agreement.
- Strict success criterion
- What happens: A task is âsolvedâ only if all rubrics pass.
- Why this exists: Real operations break if any crucial safety or format step is missed.
- Example: A drone plan that forgets airspace permission is unsafe, even if the route is fine.
đ„Ź Concept 11 â Contamination-Free Design
- What it is: Ensuring the needed knowledge wasnât learned during pretraining.
- How it works:
- Use fictional creation (new laws, new languages).
- Modify existing content (alter definitions or events).
- Include niche/emerging materials (long-tail docs).
- Why it matters: If the model already âknowsâ the answer, we canât measure learning from the given context. đ Anchor: A brand-new board game youâve never seenâso you must read the booklet now to play.
đ„Ź Concept 12 â Sequential Multi-Turn Tasks
- What it is: Sets of tasks where later ones depend on earlier answers.
- How it works:
- Present Task 1 and solve it using the context.
- Present Task 2 that builds on Task 1.
- Continue the chain, testing memory and consistency.
- Why it matters: Real jobs are chains, not single trivia questions. đ Anchor: Cooking a meal course-by-course; if you undercook the main dish, dessert timing also goes off.
đ„Ź Concept 13 â LM-as-a-Judge
- What it is: Using a strong model to check if each rubric is satisfied.
- How it works:
- Feed the solution and rubrics to the verifier.
- The verifier answers each rubric: Yes/No + rationale.
- Aggregate results; all-Yes â solved.
- Why it matters: Some answers need nuanced judgment that simple scripts canât capture. đ Anchor: A referee reading the rulebook while judging a tricky play in sports.
The secret sauce:
- Clean separation: Because all knowledge sits inside the context, this benchmark spotlights context learning (not retrieval).
- Many tiny checks: Rubrics catch subtle misses (like using the wrong function or skipping one safety step).
- Diverse context types: Different skills (deductive rule use vs. inductive discovery) reveal which âlearning musclesâ are weak.
- Realistic difficulty: Long inputs, multi-step dependencies, and formatting constraints feel like real work.
Concrete examples in action:
- Drone SDK case: The model must refuse a non-existent function, follow safety modules, include task parameters (e.g., drone ID D-998, Sector 4), and output in the documented pseudocode format.
- Physics data case: From position/time logs of a charged particle, infer helical motion and compute the pitch angle via tanΞ = vâ„/vâ„ (e.g., about 27°), as stated in the context.
Together, these steps turn âlearn from what you just readâ into a clear, fair, and hard test.
04Experiments & Results
The test: We measured âtask solving rateââdoes the model pass all rubrics for a task? We looked across four big categories: domain knowledge reasoning, rule system application, procedural task execution, and empirical discovery & simulation.
The competition: Ten frontier models using their official APIs and reasoning settings. The lineup included GPT-5.1, GPT-5.2, o3, Claude Opus 4.5, Kimi K2, HY 2.0, Gemini 3 Pro, Qwen 3 Max, DeepSeek V3.2, and Doubao 1.6.
The scoreboard (with context):
- Overall average: 17.2% solve rate. Thatâs like most students getting a D on a brand-new unit they had to learn from a handout.
- Best model: GPT-5.1 at 23.7% (around a solid C), still far from mastery.
- Category difficulty:
- Domain knowledge reasoning and procedural tasks were relatively more approachable (top models around low- to mid-20s%).
- Rule system application varied widely: legal & regulatory was easier (some models >30â40%); mathematical formalisms were harder (<15% for many).
- Empirical discovery & simulation was toughest (around 11.8% on average). Inducing laws from data or operating in simulations strained models most.
Surprising findings:
- Newer isnât always better: GPT-5.2 underperformed GPT-5.1 by 5.6% overall, especially on experimental data and some management tasks. It struggled to maintain consistent reasoning across long contexts and sometimes ignored explicit constraints.
- More âthinkingâ helps⊠sometimes: Increasing reasoning effort raised scores a little for several models (e.g., +2â6%), but not universally. For GPT-5.2 and some others, gains were small or uneven.
- Length hurts: All models dropped as inputs grew longer. From 0â4K tokens to 32K+, many fell from near 25â35% to around 5â10%.
- Error anatomy: The biggest causes were âcontext ignoredâ and âcontext misusedâ (often >60% of failures combined). Format errors also stayed high (mid-30s to >40%), showing instruction-following gaps.
đ Hook for a key pattern: Think of two kinds of schoolworkâfollowing a recipe vs. discovering a pattern from science lab data.
đ„Ź Concept 14 â Inductive (data-driven) vs. Deductive (rule-driven) Reasoning
- What it is: Deduction applies explicit rules to cases; induction discovers the rules from examples/data.
- How it works:
- Deduction: Read rule â apply to situation â get answer.
- Induction: Study data â find pattern/law â use it to answer.
- Why it matters: Induction is harder for current models; CL-bench shows a bigger drop there. đ Anchor: Deduction is using a math formula given in class; induction is figuring out a new formula from many measurements.
Big picture: Even our best models donât yet read-long, learn-new, and do-complex reliably. The numbers expose a core skill gap: paying close attention to the provided material, truly understanding it, and executing with both reasoning and formatting discipline.
05Discussion & Limitations
Limitations:
- Coverage: Even with 500 contexts and 18 subcategories, the real world is bigger. Ultra-specialized fields and odd edge-cases arenât fully represented.
- Interaction length: Many tasks are single or short multi-turn. Real deployments often need long, evolving dialogues with corrections and memory.
- Modality: Itâs text-only for now. Real work often mixes diagrams, logs, audio, and video.
- Human baselines: We donât yet have controlled human scores from non-experts learning on the spot, which would help calibrate difficulty.
Required resources to use CL-bench:
- Access to long-context-capable LMs and a consistent reasoning setting.
- Enough context window (up to ~65K tokens) and compute budget for multiple runs.
- A reliable verifier setup (e.g., GPT-5.1-as-judge) and data handling for thousands of rubrics.
When not to use:
- If your system relies mainly on retrieval quality or tool orchestration, CL-bench wonât isolate those. Itâs designed to test learning from already-provided context.
- If your tasks are trivial or donât require new knowledge, simpler benchmarks will suffice.
Open questions:
- Training: What data and curricula best teach models to privilege context over pretraining when they conflict?
- Architecture: Do we need new memory or multi-pass designs to digest long, dense contexts reliably?
- Feedback: Can we generate rubric-style feedback automatically to scale up training and evaluation?
- Stability: How do we make behavior more consistent across long contexts and inductive tasks?
Honest assessment: CL-bench squarely reveals that âread the doc, then do the jobâ is still hard for todayâs top modelsâespecially discovering rules from data and staying faithful to every instruction and format constraint. Thatâs exactly the kind of real-world skill we need next.
06Conclusion & Future Work
Three-sentence summary: Real-world AI must learn from the information you hand it right now, not just from what it memorized years ago. CL-bench cleanly measures that skill by giving fresh, expert-crafted contexts and checking solutions with many fine-grained rubrics. Todayâs frontier models solve only about 17â24%, proving that context learning is the next big hill to climb.
Main achievement: CL-bench isolates and rigorously tests context learningâlearning new rules, procedures, and laws directly from the provided materialâacross diverse, realistic scenarios with reliable automatic verification.
Future directions:
- Train with context-aware curricula and synthetic rubric feedback.
- Explore architectures with better long-term memory and multi-pass reading.
- Extend to multimodal contexts and longer interactive sessions.
- Add human baselines for clearer difficulty calibration.
Why remember this: As AI moves from trivia to tasks, its value depends on reading todayâs rules, learning them instantly, and executing safely. CL-bench shines a bright light on that exact skill and shows weâre not there yetâguiding the path to smarter, more trustworthy assistants.
Practical Applications
- âąEvaluate vendor models before deployment by running them on CL-bench tasks aligned to your domain (e.g., safety procedures or compliance).
- âąTrain internal models with curriculum derived from CL-bench patterns (short contexts â longer, single-turn â multi-turn, deductive â inductive).
- âąAdd rubric-style checks to your production prompts to reduce format errors and catch missing safety steps.
- âąUse CL-bench categories to diagnose weaknesses (e.g., if empirical discovery is low, add data-analysis fine-tuning or tool assistance).
- âąDesign your context packs like CL-bench: make all required info present, explicit, and well-structured to improve success rates.
- âąStress-test long-context performance by gradually increasing context size and tracking solve-rate drops.
- âąAdopt LM-as-a-judge internally to auto-verify complex outputs against task-specific rubrics before acting.
- âąCreate contamination-free onboarding docs for new workflows to ensure models must learn from your latest procedures.
- âąPrioritize instruction-following hygiene: enforce strict output templates and include format rubrics to cut preventable failures.
- âąTrack error types (ignored/misused context vs. formatting) to pick the right mitigation (prompt fixes, training data, or architecture changes).