AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

Kaiyuan Chen; Qimin Wu; Taiyu Hou; Tianhao Tang; Xueyu Hu; Yuchen Hou; Bikun Li; Chengming Qian; Guoyin Wang; Haolin Chen; Haotong Tian; Haoye Zhang; Haoyu Bian; Hongbing Pan; Hongkang Zhang; Hongyi Zhou; Jiaqi Cai; Jiewu Rao; Jiyuan Ren; Keduan Huang; Lucia Zhu Huang; Mingyu Yuan; Naixu Guo; Qicheng Tang; Qinyan Zhang; Shuai Chen; Siheng Chen; Ting Ting Li; Xiaoxing Guo; Yaocheng Zuo; Yaoqi Guo; Yinan Wang; Yinzhou Yu; Yize Wang; Yuan Jiang; Yuan Tian; Yuanshuo Zhang; Yuxuan Liu; Yvette Yan Zeng; Zenyu Shan; Zihan Yin; Xiaobo Hu; Yang Liu; Yixin Ren; Yuan Gong

AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

Beginner

Kaiyuan Chen, Qimin Wu, Taiyu Hou et al.1/28/2026

arXiv PDF

Key Summary

•This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.
•It covers three kinds of real tasks: clearly written step-by-step workflows, hidden rules that must be inferred from attachments, and careful edits to improve previous work.
•Tasks use real files like PDFs, PPTs, spreadsheets, and images, and the AI must produce correct, file-based results, not just chatty answers.
•The scoring uses tiny, yes-or-no checklist items (bonus and penalty) so judging is fair and consistent across different tasks and file types.
•An automated judge (a strong multimodal model) checks results, uses web search when facts change, and matches human judgment about 80% of the time.
•There are 104 tasks with 767 scoring points, mostly about work tasks but also life and study, with many different file formats to prevent overfitting.
•In tests, top agents were very close: Manus (0.645), Genspark (0.635), and ChatGPT-Agent (0.626), while Minimax-Agent scored 0.562.
•Results suggest many leading language models already have built-in agent skills, so future wins will come from better product design and user-centered refinement.
•The hardest area for agents is inferring hidden rules from attachments, like copying a slide template’s style without being told explicitly.
•The benchmark can also generate new tasks using a file-centered pipeline, helping both evaluation and future training data for reinforcement learning.

Why This Research Matters

Real people need AI that does actual work, not just chats. This benchmark checks whether agents can read your files, follow your exact steps, and deliver correct, usable results. It pushes AI to handle the messy mix of tasks we face in work, life, and study—verifying facts, copying styles from examples, and making careful edits over time. Because the scoring is clear and mostly automated, progress can be measured fairly and frequently. The findings also show that many core agent skills are already in modern models, so product design and reliability will decide who truly helps users. Finally, the dataset and pipeline can train future agents to be more trustworthy, grounded, and useful day to day.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how you ask a friend to help with homework, pack a bag, or fix a PowerPoint, and you expect them to follow your steps, use your files, and adjust when you change your mind? That’s exactly how people want AI helpers to work in real life.

🥬 Filling (The Actual Concept)

What it is: This paper is about a new way to test AI helpers (agents) on the kinds of mixed, everyday tasks regular people actually do—using instructions, files, and follow-up edits.
How it works: The benchmark gives the agent a task and attachments (like PDFs, slides, or spreadsheets), tells it what to deliver, and then judges the result with a clear checklist so we can see if the agent truly followed instructions.
Why it matters: Without this kind of test, we might think an AI is great because it solves hard puzzles, but it could still fail at everyday jobs like “follow this 5-step plan and update my document exactly.”

🍞 Bottom Bread (Anchor) Imagine asking an AI: “Plan my trip using the conference website, double-check the location, then give me a cheap and a fast option.” This benchmark checks if the AI really does all of that, in order, and with proof.

— New Concept 1 — 🍞 Top Bread (Hook) Imagine learning a new language so well that you can read stories, follow directions, and write your own answers.

🥬 Filling (The Actual Concept)

What it is: Large Language Models (LLMs) are very big computer programs trained to understand and generate human language.
How it works: 1) Read your words, 2) connect them to patterns learned from tons of text, 3) predict the most likely helpful next words, 4) produce answers or plans.
Why it matters: Most agent assistants today are powered by LLMs, so their language skills control how well they understand tasks.

🍞 Bottom Bread (Anchor) When you type “Compare two phone plans and show the cheapest total cost,” the LLM figures out what that means and how to explain the result.

— New Concept 2 — 🍞 Top Bread (Hook) You know how you can read a book, watch a picture, and listen to music? Your brain understands many kinds of information.

🥬 Filling (The Actual Concept)

What it is: Multi-modal Input/Output means the AI can work with different types of data—text, images, slides, spreadsheets—and also produce them.
How it works: 1) The agent opens files, 2) extracts text or visuals, 3) reasons across them, 4) outputs the right kind of file (like a .pptx or .xlsx) or text summary.
Why it matters: Real tasks use files, not just chat text. Without multimodal skills, the AI can’t finish real deliverables.

🍞 Bottom Bread (Anchor) If you give the AI a floor plan (SVG) and a constraints sheet (Excel), multi-modal ability lets it read both and update the layout correctly.

— New Concept 3 — 🍞 Top Bread (Hook) Think of reading instructions carefully—like a recipe—so you don’t forget a step or add ingredients you weren’t told to use.

🥬 Filling (The Actual Concept)

What it is: Natural Language Processing (NLP) is how computers understand and use human language.
How it works: 1) Parse your sentence, 2) map words to meanings, 3) track the steps and constraints, 4) generate the right response.
Why it matters: If NLP is weak, the AI might miss steps, add fake facts, or give the wrong file format.

🍞 Bottom Bread (Anchor) “Check the official site first, then cross-check elsewhere” is plain English. Good NLP helps an AI keep that exact order.

The world before this benchmark looked shiny in demos—agents could code, do research, and pass complicated tests in specific areas. But everyday users didn’t always feel the benefits. Why? Because most benchmarks chased “harder and harder” puzzles, not the broad variety of daily tasks people truly need. Many tests also didn’t force agents to handle attachments or deliver finished files. So agents could talk well, but stumble when asked to build a real PowerPoint, edit a spreadsheet, or update a design exactly.

Researchers tried pieces of the problem. Some benchmarks checked if models followed formats exactly. Others tested web browsing, tool use, or domain-specific skills (like websites or coding). A few examined real-world economic value. These all helped, but none covered a day’s mixed workload with files, step-by-step workflows, hidden rules from attachments, and follow-up edits—together.

The missing piece was a benchmark that: 1) accepts natural language instructions, 2) requires working with multiple file types, 3) demands faithful step-by-step execution, 4) tests hidden rule-following from attachments, and 5) allows iterative refinement like real teamwork. Plus, it needed clear, fair scoring that matches human judgment.

AgentIF-OneDay fills that gap. It collects tasks from work, life, and study; includes attachments; and requires concrete outputs. It also scores with tiny, objective, yes/no checklist items, and uses a strong multimodal judge model aligned with human graders. The result is a test that looks and feels like a typical person’s busy day—where success means finishing the job correctly, not just sounding smart.

Why should anyone care? Because this is how we’ll know if AI can truly help with school projects, office reports, trip plans, research summaries, or design updates—reliably, safely, and on time. If your AI can take your files, follow your exact rules, and improve its work across several turns, then it’s a tool you can trust for real life, not just a cool demo.

02Core Idea

🍞 Top Bread (Hook) Imagine giving a helper a full day’s to-do list—check the website, verify facts, read my PDFs, make slides in my style, and fix things when I give notes—and getting back correct files that match exactly what you asked.

🥬 Filling (The Actual Concept)

What it is: The key idea is a task-level, file-centered benchmark—AgentIF-OneDay—that checks if AI agents can follow real instructions across a day’s worth of jobs, using attachments and producing concrete deliverables.
How it works: 1) Provide natural language tasks and attachments, 2) require exact outputs (files, formats, content), 3) judge with instance-level rubrics (bonus/penalty, yes/no), 4) use a strong multimodal LLM-as-judge with web search for changing facts.
Why it matters: It measures what people actually need—accurate, grounded, step-following work—rather than only testing abstract reasoning.

🍞 Bottom Bread (Anchor) A travel plan task requires checking the official conference page first, cross-checking elsewhere, then building two itineraries. The agent must show it did each step correctly and deliver the final plans.

— New Concept 4 — 🍞 Top Bread (Hook) You know how a good school day mixes different classes—math, art, P.E.—not just harder and harder math?

🥬 Filling (The Actual Concept)

What it is: AgentIF-OneDay is a benchmark that mixes many real-life tasks and files to see if AI can follow instructions in daily scenarios.
How it works: 1) Curate tasks from work, life, and study, 2) include attachments (PDFs, PPTs, spreadsheets, images), 3) require specific deliverables, 4) score with a clear checklist.
Why it matters: Without diversity, a model might ace one tricky domain but fail typical office or home tasks.

🍞 Bottom Bread (Anchor) Across 104 tasks and 767 scoring points, the test covers trip planning, phone-plan math, research slides, layout edits, and more.

— New Concept 5 — 🍞 Top Bread (Hook) Imagine following a cooking recipe exactly in the right order—even if it’s long.

🥬 Filling (The Actual Concept)

What it is: Open Workflow Execution is when the agent follows a detailed, explicit set of steps (a workflow) carefully and completely.
How it works: 1) Read the step-by-step plan, 2) keep all steps in memory, 3) do them in order, 4) avoid making things up, 5) produce proof-based results.
Why it matters: If the agent forgets steps or invents facts, the final output can’t be trusted.

🍞 Bottom Bread (Anchor) “Verify conference location on the official site, cross-check elsewhere, then plan flights” is a classic Open Workflow Execution task.

— New Concept 6 — 🍞 Top Bread (Hook) Think of copying a poster’s style by looking at a sample: same fonts, same layout rules—without anyone telling you the exact recipe.

🥬 Filling (The Actual Concept)

What it is: Latent Instruction Inference means the agent discovers hidden rules from attachments and applies them correctly to new outputs.
How it works: 1) Read the attachment, 2) notice patterns (format, pricing logic, citation style), 3) infer the unstated rules, 4) use them in a new task.
Why it matters: People often share examples instead of full instructions; agents must learn from context.

🍞 Bottom Bread (Anchor) Given a phone-plan PDF, the agent must combine base price, trade-in value, and plan fees to find the cheapest total cost.

— New Concept 7 — 🍞 Top Bread (Hook) Imagine a teacher asking you to fix a report they already marked up—don’t restart, just improve what’s there.

🥬 Filling (The Actual Concept)

What it is: Iterative Refinement is improving an existing output over multiple turns while keeping the current state consistent.
How it works: 1) Load the current work, 2) apply precise changes, 3) don’t break what works, 4) repeat as the user adds new constraints.
Why it matters: Real teamwork is multi-turn; agents must edit carefully, not start from scratch each time.

🍞 Bottom Bread (Anchor) Update a venue SVG using rules in an Excel file so all constraints are satisfied without ruining readability or walkability.

— New Concept 8 — 🍞 Top Bread (Hook) Picture a teacher’s checklist: each box is either checked or not—simple and fair.

🥬 Filling (The Actual Concept)

What it is: Instance-level Rubrics are tiny, objective checks (bonus or penalty) that judge if each requirement was met.
How it works: 1) Break the task into verifiable points, 2) mark each as satisfied or not, 3) add bonuses, subtract penalties, 4) normalize to get the final score.
Why it matters: Clear rubrics prevent fuzzy grading and make results comparable.

🍞 Bottom Bread (Anchor) “Did you verify the venue on the official site?” Yes gets +1; “Did you miss the citation marker?” No marker might be -1.

Three analogies for the whole idea:

Field trip checklist: Bring permission slip, lunch, and water; do safety check; then go. The agent must pass each box.
Lego manual: Follow numbered steps, use the right pieces (attachments), and snap them in order to build the set.
Cooking show: Prep all ingredients (files), follow the recipe (workflow), plate it nicely (format), and taste-test (rubric) at the end.

Before vs. After: Before, agents often looked smart in narrow tests. After, we can see if they actually deliver the right files, follow all steps, respect hidden rules from examples, and make careful fixes over time.

Why it works: It measures grounded behaviors that matter—order-following, file comprehension, hidden-rule inference, and careful editing—using objective checklists and a strong multimodal judge model aligned with humans. The building blocks are the three task types, file-based deliverables, instance-level rubrics, and an LLM-as-judge that can search the web when facts change.

03Methodology

🍞 Top Bread (Hook) Imagine your day’s chores turned into missions with clear steps, example documents, and a teacher who checks each requirement one by one. That’s how this benchmark runs agents through real-life tasks.

🥬 Filling (The Actual Concept)

What it is: A recipe-like pipeline that turns daily tasks into fair tests with files and precise scoring.
How it works at a high level: Input (task + attachments) → Agent produces outputs (often files) → LLM-as-judge uses rubrics, tools, and web search to verify → Final score.
Why it matters: To compare agents fairly, we need consistent tasks, realistic files, and objective judging.

🍞 Bottom Bread (Anchor) For a travel plan: the agent must verify facts online, then produce two itineraries. The judge checks that each required verification and output is present.

— New Concept 9 — 🍞 Top Bread (Hook) Think of a referee who can read text, look at pictures, open slides, and even check the internet for up-to-date facts.

🥬 Filling (The Actual Concept)

What it is: LLM-as-judge is an automated grader (a strong multimodal model) that scores with the rubric and uses tools like web search and file rendering.
How it works: 1) Read the rubric items, 2) open outputs and attachments, 3) verify each point (yes/no), 4) search the web if facts may have changed, 5) compute the score.
Why it matters: This allows scalable, consistent grading, closely matching humans (~80% agreement).

🍞 Bottom Bread (Anchor) When grading “Check the official venue,” the judge loads the output, follows included links, and confirms the venue on the official site.

— New Concept 10 — 🍞 Top Bread (Hook) Imagine grading a poster with a checklist: each box is either checked (+1) or a mistake (-1). No maybes.

🥬 Filling (The Actual Concept)

What it is: Binary scoring with bonus and penalty items measures success and catches harmful errors separately, then normalizes the total.
How it works: 1) For each item, mark satisfied or not, 2) sum bonuses, 3) subtract penalties, 4) clamp at zero if needed, 5) divide by the max possible to get a 0–1 score.
Why it matters: Separating “capability achieved” from “critical mistakes” gives a fairer picture.

🍞 Bottom Bread (Anchor) “Provide two plans” (+1) and “don’t include unverified claims” (-1) can both apply; you add and subtract accordingly for the final grade.

Task design and coverage: The benchmark includes 104 tasks with 767 scoring points. About 59.6% are work-related, 23.1% study, and 17.3% life. It spans many file types: PDFs, PPTX, XLSX, images, HTML, code, and more. Some tasks have multiple attachments (up to 10), testing multi-file reasoning.

Evaluation pipeline details: For HTML outputs, the judge renders pages to avoid misreading raw code. For PPT/HTML, it uses vision-language capabilities to “see” layouts and styles. If a task depends on time-sensitive facts (like conference schedules), the judge enters Search Mode to verify with Google Search. Prompts for evaluation are standardized for consistency.

Data creation: Human annotators first crafted seed tasks with clear, verifiable answers that aren’t easily solvable by quick web searches. Then an automated, file-centered synthesis pipeline expanded the dataset.

— New Concept 11 — 🍞 Top Bread (Hook) Imagine reading a math problem and writing down the step-by-step plan to solve it before you start.

🥬 Filling (The Actual Concept)

What it is: Workflow Extraction pulls out the logical steps (inputs, outputs, dependencies) from a human-made task.
How it works: 1) Read the seed task, 2) list steps in order, 3) note inputs/outputs for each step, 4) record which steps depend on earlier ones.
Why it matters: This blueprint lets us generate many similar tasks in new domains while keeping the same reasoning skeleton.

🍞 Bottom Bread (Anchor) From “Verify venue, cross-check, get dates, check full schedule, then plan travel,” we extract a 5-step workflow used to create new but similar tasks.

— New Concept 12 — 🍞 Top Bread (Hook) Think of hunting for example documents that are rich and clear so you can design similar problems later.

🥬 Filling (The Actual Concept)

What it is: Attachment Searching finds realistic files (dashboards, invoices, templates) that fit the workflow’s needs.
How it works: 1) Generate specific search queries, 2) collect candidate files, 3) analyze their content, 4) select those with the right data and structure.
Why it matters: Good attachments make tasks real and ensure answers are verifiable.

🍞 Bottom Bread (Anchor) To test cost calculations, the pipeline searches for phone-plan PDFs that list prices, trade-ins, and fees.

— New Concept 13 — 🍞 Top Bread (Hook) Imagine reusing a great lesson plan but swapping in new examples so students face fresh challenges with the same skills.

🥬 Filling (The Actual Concept)

What it is: Query Generation creates new tasks that follow the same workflow pattern but use different content and domains.
How it works: 1) Keep the workflow, 2) change the setting and files, 3) ensure outputs are verifiable, 4) match one of the three task types.
Why it matters: This grows the dataset’s diversity without losing rigor.

🍞 Bottom Bread (Anchor) A travel-planning workflow could become a museum-visit workflow with different dates and cities but the same verify-then-plan logic.

— New Concept 14 — 🍞 Top Bread (Hook) Think of a teacher writing a grading guide so any other teacher can score fairly.

🥬 Filling (The Actual Concept)

What it is: Rubrics Generation turns the workflow into bonus/penalty checks that are independently verifiable.
How it works: 1) Add a check for each workflow step, 2) include hidden-rule checks, 3) define penalties for format or harmful changes, 4) specify how to verify each item.
Why it matters: Clear rubrics power consistent, scalable evaluation.

🍞 Bottom Bread (Anchor) “Include citation markers in the lower-left of PPT slides” becomes a precise, checkable rubric item.

— New Concept 15 — 🍞 Top Bread (Hook) Before printing a class newsletter, you fix vague parts, add missing dates, and remove anything risky or private.

🥬 Filling (The Actual Concept)

What it is: Filtering and Rewriting ensures synthetic tasks are complex, safe, time-stable, and verifiable.
How it works: 1) Require 3+ steps, 2) make answers concrete (files/numbers), 3) replace fuzzy time or place with fixed references, 4) verify sources exist, 5) remove sensitive operations.
Why it matters: This keeps tasks realistic, safe, and future-proof.

🍞 Bottom Bread (Anchor) “Analyze last week’s prices” becomes “Analyze prices from Jan–Jun 2024,” so the task doesn’t break over time.

Secret sauce: The combination of file-centered tasks, three complementary task types, binary rubrics with bonus/penalty, and a search-enabled multimodal judge yields human-aligned, objective scores. It measures not just talking well, but actually doing the work correctly in real formats.

04Experiments & Results

🍞 Top Bread (Hook) Imagine a school tournament where teams build projects from real instructions and files, then a fair judge checks every requirement. Now we can see who actually finishes the job right.

🥬 Filling (The Actual Concept)

What it is: The authors tested four leading agent systems on AgentIF-OneDay to see how well they follow instructions, handle attachments, and deliver correct files.
How it works: Each agent got the same 104 tasks with 767 scoring points. Their outputs were graded by the automated judge using binary rubrics, with web search for time-sensitive facts.
Why it matters: This gives a realistic scoreboard for agent products people might actually use.

🍞 Bottom Bread (Anchor) If an agent says it checked the official conference site, the judge really verifies that link before awarding the point.

The competition: Manus, Genspark, ChatGPT-Agent, and Minimax-Agent were tested in December 2025. Gemini-3-Pro (preview) was used as the judge model with a large context (up to 100k tokens) and low temperature (0.1) to keep judging stable.

The scoreboard (contextualized):

Manus: 0.645 overall—like a solid A-/B+.
Genspark: 0.635—also a strong B+/A- territory.
ChatGPT-Agent: 0.626—close behind, a strong B+.
Minimax-Agent: 0.562—more like a C+/B-. The top three formed a tight pack, suggesting many core agent skills are now widely available.

Category strengths:

By capabilities: Genspark led in Instruction Following (0.766) and tied for best in handling Negative Constraints (0.824). Manus led in Factuality (0.731). Minimax-Agent, despite a lower overall score, led in Logic/Functionality (0.755), hinting at strong reasoning but perhaps slower or less consistent execution.
With vs. without attachments: Genspark was strongest with attachments (0.691). Manus showed rare robustness, scoring nearly the same with attachments (0.646) as without (0.644).

Domains (work/life/study):

Work: ChatGPT-Agent ranked first (72.18), then Genspark (71.86), then Manus (70.27).
Life: Manus first (73.40), then ChatGPT-Agent (69.67), then Genspark (67.85).
Study: Genspark first (71.19), then Manus (64.41), then ChatGPT-Agent (59.29). These patterns suggest different product focuses: ChatGPT-Agent is tuned for professional workflows, Manus shines in life tasks, and Genspark excels for learning and study.

Latency (efficiency):

Genspark (~484 s) and Manus (~500 s) balanced speed and quality.
Minimax-Agent was slowest (~1416 s), which may connect to its heavy reasoning style (best Logic/Functionality) but hurts overall throughput.

Judge agreement with humans: On a 28-problem, 171-criteria set, Gemini-3-Pro-preview reached about 80.1% agreement with human graders, outperforming Gemini-2.5-Pro (73.9%) and GPT-5.1 (63.8%). This level is strong for automation but still leaves room for disagreements on fuzzy concepts like “conciseness” or “design sense.”

Surprising findings:

Performance parity: API-based agents (built mainly with prompts and tools) can match custom RL-based agents on this benchmark. This hints that agentic capabilities are increasingly built into modern LLMs.
Hidden-rule inference is the toughest: Many agents struggled to perfectly copy formatting or pricing logic inferred from attachments—sometimes getting content right but missing style details (like citation markers or slide layout features).
Robustness to attachments: Manus maintaining nearly identical scores with and without attachments is unusual and valuable; it means its performance doesn’t collapse when files enter the scene.

Case studies show the nuances: In a DeepMind-in-Nature PPT task, ChatGPT-Agent kept the page style better but missed listing enough articles; Genspark missed a citation marker and added irrelevant details. In a golf-driver shopping task, ChatGPT-Agent hit the price constraint but missed visual style and some specs; Genspark better matched the background style and specs, showing stronger cross-modal reasoning.

Overall, the benchmark surfaces meaningful product differences that matter to real users: who follows steps best, who respects negative constraints, who stays factual, and who can juggle files without stumbling.

05Discussion & Limitations

🍞 Top Bread (Hook) Imagine grading a science fair: even with good rules, there are tricky parts—some projects need special tools to judge, and not all judges agree on style.

🥬 Filling (The Actual Concept)

Limitations: Creating rich, verifiable tasks is expensive (about three hours per task), and individual annotators run out of fresh, realistic scenarios quickly. Daily-life topics vary so widely that it’s hard to find experts to verify everything. Automated synthesis helps scale but can’t fully replace high-quality human design. Automated judging, while strong, still disagrees with humans on fuzzy notions like “conciseness” or “design sense,” and time-sensitive facts require careful web verification.
Required resources: Using this benchmark needs an agent that can read and write many file types, handle long contexts, and optionally use tools like web browsers. For scoring, a capable multimodal LLM judge with large context and safe web access is ideal.
When not to use: If you only care about a single vertical (say, just coding or just math) or need open-ended creativity without verifiable outputs, a broad, file-centered benchmark with binary rubrics may be overkill. Also, if you can’t support attachments, you won’t exercise much of what this benchmark tests.
Open questions: How can we make hidden-rule inference more reliable? What’s the best way to model long-horizon memory and state so iterative edits never “forget” prior constraints? Can we reach >90% judge-human agreement for style-sensitive checks? How do we expand from “OneDay” to “OneWeek” without drifting into domain-specific hand-tuning? And how can we best turn these rubrics into reinforcement learning signals to steadily lift real-world reliability?

🍞 Bottom Bread (Anchor) Think of the next version as a week-long group project: more files, more steps, more check-ins—demanding better memory, better hidden-rule understanding, and even fairer judging tools.

06Conclusion & Future Work

🍞 Top Bread (Hook) Imagine hiring an assistant for a busy day and handing them your files. You only keep them if they follow every step, respect your examples, and fix things exactly as you ask.

🥬 Filling (The Actual Concept)

3-Sentence Summary: AgentIF-OneDay is a task-level, file-centered benchmark that tests if AI agents can follow real instructions in daily scenarios across three task types: open workflows, latent instruction inference, and iterative refinement. It judges results with tiny, objective checklist items and a strong multimodal LLM-as-judge aligned with human graders. Experiments show top agents cluster closely, implying agentic skills are increasingly built into modern LLMs, but hidden-rule inference and long-horizon consistency remain tough.
Main Achievement: Turning everyday, file-based tasks into a fair, scalable, human-aligned test that measures whether agents actually finish real work correctly—not just talk about it.
Future Directions: Extend to “OneWeek” time horizons, boost hidden-rule inference (from formats to pricing logic), strengthen memory for multi-turn edits, and convert rubrics into better training signals for reinforcement learning. Improving judge models to surpass 90% agreement on style-sensitive checks is another key goal.
Why Remember This: It shifts evaluation from clever chat to dependable completion—verifying that agents can read your files, follow your steps, apply your examples’ hidden rules, and refine work over time. That’s the path from impressive demos to trustworthy daily helpers.

🍞 Bottom Bread (Anchor) If your AI can plan your trip by verifying sources, copy your slide style from an example, and then polish the deck after your comments—this benchmark says, “Yes, it can.”

Practical Applications

•Evaluate your in-house AI assistant’s ability to follow multi-step company workflows with attached documents.
•Benchmark different vendor agents on your everyday scenarios (travel planning, report updates, spreadsheet edits) before buying.
•Use the rubric design to create internal checklists for automated quality control of AI outputs.
•Train agents with reinforcement learning using these instance-level rubrics as reward signals.
•Stress-test an agent’s hidden-rule inference by giving it style templates (PPT/HTML) and checking format fidelity.
•Measure how well an agent maintains state across iterative edits in multi-turn collaboration.
•Assess robustness to attachments by mixing PDFs, PPTX, XLSX, images, and HTML deliverables.
•Automate regression testing for new agent releases with consistent, objective scoring.
•Identify product strengths (e.g., factuality vs. instruction following) and target improvements.
•Prototype a “OneWeek” extension to test longer, real-world projects with the same methodology.

Version: 1