The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios
Key Summary
- ā¢The paper introduces Trainee-Bench, a new way to test AI agents that feels like a real first day at work, with tasks arriving over time, hidden clues, and changing priorities.
- ā¢It focuses on three hard skills: scheduling many tasks with deadlines, actively exploring to find missing info instead of guessing, and learning from past mistakes so errors donāt repeat.
- ā¢Unlike old benchmarks that are static and fully visible, Trainee-Bench randomizes details, hides key clues, and links tasks across time to prevent memorization and force real reasoning.
- ā¢Top models still struggle: the best model reached only a 35% success rate, showing todayās agents are far from reliable in dynamic workplaces.
- ā¢Agents often fail most on āhardā tasks that require finding hidden info and precise calculations; performance drops further as more tasks overlap.
- ā¢Continual learning was mixed: using yesterdayās notes sometimes helped on hard tasks but actually hurt on easy ones due to mismatched experiences.
- ā¢Human hints made a huge difference, boosting average scores from 0.24 to 0.83 on difficult tasksāshowing current agents need help to explore and plan well.
- ā¢The benchmark provides step-by-step checkpoints and feedback so we can measure partial progress and analyze where agents break.
- ā¢This work moves evaluation from clean lab tests to messy, realistic settings and sets a roadmap for building more dependable AI coworkers.
Why This Research Matters
Real jobs are messy: tasks arrive unpredictably, information lives in different apps and peopleās heads, and deadlines collide. Trainee-Bench tests AI agents in that real mess, so we can trust them with important work like scheduling, reporting, and planning. By rewarding exploration and step-by-step progress, it encourages agents to ask smart questions instead of guessing. The findings show big gapsāespecially in hard tasks and self-improvementāso companies know where AI still needs human oversight. The benchmarkās feedback helps designers build agents that learn from experience and avoid repeating mistakes over time. As these agents improve, they can safely handle more of the busywork, freeing people to focus on creative and human-centered tasks.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine itās your first day as a school helper. Teachers give you jobs at random timesāput up posters, sort books, and run to the office before the bell. Some instructions are missing, so you must ask questions or look around. You also need to remember what you learned yesterday so you donāt repeat mistakes.
š„¬ The World Before: For years, AI modelsāespecially multimodal large language models (MLLMs)āgot really good at solving fixed problems where everything was known in advance. Benchmarks usually gave a single, clear task with all the info provided, and the AI solved it in one go. This proved best-case skill, but not reliability when life gets noisy and tasks overlap.
š Anchor: Think of a math worksheet with all problems on one page and plenty of time. Thatās what many old benchmarks felt likeāorderly and complete.
š Hook: You know how class can get chaotic? A friend needs help, the bell rings, and a teacher changes the plan. You juggle it all by planning, asking smart questions, and remembering what worked before.
š„¬ The Problem: Real workplaces are dynamic. Tasks arrive as a stream, deadlines collide, crucial info is hidden in files or with coworkers, and new tasks appear mid-meeting. Todayās agents often hallucinate (confidently guess), forget context after switching tasks, and fail to reuse experience.
š Anchor: Itās like being told āPlan Fridayās event,ā but the event date is in a calendar on the wall, the budget is in a spreadsheet, and the room is booked unless you check first.
š Hook: Picture a chef who has to cook multiple dishes while deliveries arrive late and recipes are missing. They must reorder steps, ask the supplier, and remember last timeās fix.
š„¬ Failed Attempts: Many prior agent tests were static, fully visible, and one-shot. Agents looked good when everything was provided, but broke when info was missing or when a second task interrupted the first. Tool-use tests helped, but most still didnāt measure exploration (asking, looking) or learning across days.
š Anchor: Practicing only on practice tests with answer keys doesnāt teach you how to handle pop quizzes with missing pages.
š Hook: You know how coaches make drills that feel like real gamesātimers, defenders, and noiseāso players can perform under pressure?
š„¬ The Gap: We needed a benchmark that (1) streams tasks over time with priorities and deadlines, (2) hides necessary clues so the agent must explore and ask, and (3) repeats patterned tasks across days so the agent can learn general strategies, not just memorize examples.
š Anchor: Itās like a school day simulator: announcements interrupt you, some instructions are on posters in the hallway, and you compare yesterdayās mistakes to do better today.
ā New Concepts ā
š Hook: Imagine a super student who can read text, look at pictures, and talk. š„¬ Multiāmodal Large Language Models (MLLMs): What: MLLMs are AIs that understand and generate multiple kinds of data (like text and images) and can use tools. How: They read inputs (text/images), plan steps, call tools (e.g., calendars, files), and produce answers. Why: Without MLLMs, agents canāt handle real offices where info is in emails, images, and apps. š Anchor: Like a student who reads a memo, checks a map, and sends an email.
š Hook: You know how teachers give pop-up chores during class? š„¬ Dynamic Task Scheduling: What: Organizing incoming tasks on the fly by priority and deadline. How: Track all tasks ā pick what matters now ā pause and resume as things change. Why: Without it, agents miss deadlines and lose context when interrupted. š Anchor: Choosing to grab the attendance sheet now because the bell rings in 2 minutes.
š Hook: When you donāt know an answer, you raise your hand or look it up. š„¬ Active Exploration: What: Purposefully seeking missing info (ask, search, open files) instead of guessing. How: Notice uncertainty ā plan questions ā use tools/NPCs ā verify clues. Why: Without it, agents hallucinate and break tasks. š Anchor: Asking the librarian where the science books are before writing a report.
š Hook: Remember how you improve at a game by learning patterns? š„¬ Continual Learning: What: Learning over time without forgetting, turning experience into general strategies. How: Get feedback ā reflect/summarize ā store lessons ā apply next time. Why: Without it, agents repeat the same mistakes. š Anchor: After missing the bus yesterday, you leave earlier today.
02Core Idea
š Hook: Imagine a realistic āfirst day at workā simulator where jobs arrive during the day, key info is hidden in files or with coworkers, and meetings interrupt your plans.
š„¬ Aha! Moment (in one sentence): Test agents the way real work happensāstreamed, partially hidden, and changingāso we measure scheduling, exploration, and learning, not just final answers.
š Anchor: Itās like a school day simulation that checks if you can plan, ask for missing info, and get better tomorrow.
ā Three Analogies ā
- Theme park day: New rides open, lines change, and a friend holds the map. Great performance means reordering plans, asking staff, and remembering shortcuts.
- Kitchen rush: Orders stream in, ingredients are missing, and ovens ping mid-recipe. You reprioritize, call the supplier, and reuse last weekās prep tricks.
- Detective case: Clues are scattered, suspects appear mid-interview, and you must recall patterns from old cases.
ā Before vs After ā
- Before: Benchmarks were like neat worksheetsāsingle-shot, all info given.
- After: Trainee-Bench is like a live school dayātasks stream in, some info is hidden, and youāre graded along the way.
ā Why It Works (intuition) ā
- Randomization prevents memorizing shortcuts.
- Hidden clues force active exploration instead of guessing.
- Timed, overlapping tasks reveal if agents can switch context and keep track of goals.
- Checkpoints show partial progress and give feedback for learning.
ā Building Blocks (with Sandwiches) ā
š Hook: Think of āgame levelsā that produce countless unique missions. š„¬ Metaātasks (rule-based templates): What: Abstract task blueprints (e.g., āplan a meetingā) that can generate many concrete jobs. How: Fill templates with randomized people, files, and data to create fresh instances. Why: Without templates, agents could memorize fixed tasks. š Anchor: A āwrite a letterā prompt that changes recipient, topic, and deadline each time.
š Hook: Like rolling different dice each round. š„¬ Randomization Function: What: A controlled generator that varies names, file paths, numbers, and schedules. How: Use a seed to create NPC profiles and environment data; plug them into rules. Why: Without randomness, evaluations repeat and donāt test generalization. š Anchor: Same recipe, new ingredients and cooking times.
š Hook: Sometimes the instruction sheet is missing a page. š„¬ Partial Observability: What: The task goal is given, but key clues (manuals, passwords, instructions) are hidden. How: The agent must ask NPCs, open folders, and search to reveal clues. Why: Without this, agents just follow the prompt and never learn to explore. š Anchor: You must ask the office manager for the WiāFi password before you can send the report.
š Hook: Picture a timeline where alarms occasionally force you to switch tasks. š„¬ Dynamic Composite Scenarios: What: Multiple tasks stitched into a day with deadlines, interrupts, and dependencies. How: Assemble task instances along time; ensure names donāt clash; add meetings that preempt work; reveal new tasks during old ones. Why: Without composition, agents arenāt tested on juggling and adapting. š Anchor: Stop writing to attend homeroom, then resume exactly where you left off.
š Hook: Instead of only checking the final grade, imagine getting credit for each correct step. š„¬ Checkpoints & Automated Verification: What: Milestones that can be automatically checked to score progress. How: Each task has verifiable steps; passing steps increases the score and triggers feedback. Why: Without checkpoints, we canāt see where agents stumble or learn from mistakes. š Anchor: You get points for finding the right date, more points for booking the room, and feedback if you forgot snacks.
š Hook: Think of three players passing the ball. š„¬ Interaction Protocol (EnvironmentāAgentāMLLM): What: A standard loop where the environment offers tools/tasks, the agent plans, and the MLLM produces actions. How: Maintain history ā call model ā parse tool calls ā update state ā repeat. Why: Without a clear protocol, different agents canāt be compared fairly. š Anchor: Teacher (environment) posts tasks, student (agent) plans, calculator (MLLM) computes, then the class list updates.
03Methodology
At a high level: Input (rule library + random seed + tool APIs) ā Step A: Instantiate meta-tasks ā Step B: Compose a daily scenario with timelines and priorities ā Step C: Run the interaction loop (agent ā MLLM ā environment) ā Step D: Verify with checkpoints and produce feedback ā Output: Scores, traces, and experience for learning.
Step A ā Task Instantiation (What happens):
- Pick a meta-task rule (e.g., āEvent Planningā).
- Use the randomization function to generate NPCs (names, roles), files (paths, data tables), and constraints.
- Produce a natural-language task description, a hidden set of clues (e.g., where the manual is), and a list of checkpoints. Why it exists: Prevents agents from memorizing fixed answers; forces general reasoning. Example: āPlan a team-building day under $2,000. Availability is hidden in calendar.csv; travel times in map.json.ā Clues (hidden): āCheck CloudDisk:/events/manual.md for scoring rules.ā
Step B ā Build a Dynamic Composite Scenario (What happens):
- Combine 2ā6 task instances into one timeline.
- Enforce conflict resolution (unique names/IDs) and define conditional triggers for NPCs who guard multiple clues.
- Add temporal prioritization (e.g., a 10:30 meeting that preempts current work).
- Insert inter-task dependencies (a new assignment revealed mid-meeting). Why it exists: Tests scheduling, context switching, and plan updating under uncertainty. Example: While drafting an ad plan, an urgent meeting pops up; during that meeting, a manager announces a new budget constraint that changes the plan.
Step C ā Interaction Protocol (What happens):
- The environment provides: system prompt, tools (OpenFolderInCloudDisk, SendMessage, CalendarAPI), and current time.
- The agent maintains history, queries the MLLM for the next action, and executes tool calls.
- Partial observability means the agent must explore: open folders, ask NPCs, and verify results. Why it exists: Standardizes how agents act and think so comparisons are fair. Example with data: The agent calls OpenFolderInCloudDisk({path: "CloudDisk:/ads_strategy/"}) and finds ads_strategy_handbook.md; then messages Sarah (NPC) to confirm targeting rules.
Step D ā Automated Verification & Feedback (What happens):
- Each task has checkpoints like: āFound valid date,ā āBooked room,ā āCalculated exposure correctly,ā āSent confirmation email.ā
- The system auto-checks passed steps and normalizes to a scenario score.
- Each checkpoint emits natural-language feedback for missed items. Why it exists: Measures partial progress and provides learning signals. Example: Feedback says, āYou didnāt request the authorization code from the manager,ā guiding the agent next time.
Step E ā Continual Learning Flow (Optional):
- Day 1: Run the scenario, collect feedback, and summarize lessons (e.g., āAlways check CloudDisk:/manuals_for_intern.md before planningā).
- Day 2: Similar tasks are re-instantiated with different details; the agent uses Day 1 lessons to adapt. Why it exists: Tests whether agents extract general strategies, not just memorize. Example: Yesterday the agent missed āask HR who maintains the websiteā; today it proactively asks HR at the start.
The Secret Sauce:
- Partial observability + randomization make guessing risky and exploration necessary.
- Composite timelines stress-test scheduling and memory.
- Checkpoints provide dense, actionable feedback that fuels continual learning.
Concrete Mini-Case ā Advertising Campaign Planning:
- Input: Heatmaps images, channels.csv with cost/exposure, budget.
- Agent must: Identify high-density areas, pick channels to maximize exposure under budget.
- Hidden clue: āTreat as a knapsack optimization; use provided density matrix instead of inferring from images.ā
- What breaks without steps: If the agent skips exploration, it may misread data; if it canāt schedule, it misses the budget meeting; without learning, it repeats cost calculation mistakes.
Concrete Mini-Case ā Event Planning:
- Input: availability.csv, locations.json, map distances.
- Agent must: Choose a valid date, build an itinerary, and compute distances/time within constraints.
- Hidden clue: Valid date range is buried in the shared calendar; scoring rules are in a manual file.
- What breaks: Missing the valid date invalidates the whole plan; failing to verify distances leads to wrong scores.
04Experiments & Results
The Test: Researchers built 50 dynamic scenarios, each with 2ā6 tasks. They measured four things: (1) Success Rate (percentage of tasks finished), (2) Checkpoint Score (how many steps were correctly done), (3) Average Steps (depth of interaction), and (4) Average Tool Calls (tool-use effort). Models followed the same protocol and got summarized history if the context got too long.
The Competition: Seven modern models were tested, including Gemini-3-Flash, Claude-4-Sonnet, GPT-5.1, Grok-4, GPT-4o, Qwen3-VL-235B-A22B, and Llama-4-maverick.
The Scoreboard (with context):
- Gemini-3-Flash led with a 35% Success Rate and a 0.639 Checkpoint Scoreālike getting a strong B when most are at C levelsāwhile also taking more steps (90) and tool calls (232), showing that thorough exploration helps.
- Claude-4-Sonnet reached a 23% Success Rate and a 0.593 Checkpoint Scoreāsolid but still far from workplace reliability.
- GPT-4o and Grok-4 performed moderately; Qwen3-VL was lower; Llama-4-maverick struggled due to tool-calling errors and instruction-following issues.
Workload Matters: As the number of concurrent tasks rose from 2 to 6, models like Gemini-3-Flash, Grok-4, and GPT-4o declined in Success Rate (e.g., Gemini-3-Flash from 50% to 36%), revealing how context switching and temporal uncertainty challenge agents.
Task Difficulty Hits Hard: Splitting tasks into Easy vs Hard showed sharp drops on Hard tasks. For example, Grok-4 fell from 34% to 7% Success Rate. Gemini-3-Flash still held 32% on Hard tasks, showing better robustness in complex reasoning and exploration but still leaving much room for improvement.
ā Sandwich: Metrics ā š Hook: Think of grading both your final answer and each step you got right. š„¬ What/How/Why: Success Rate is the share of fully solved tasks; Checkpoint Score is the fraction of steps passed, rewarding partial progress; Steps and Tool Calls show how much effort/exploration the agent invested. Without these, we canāt tell whether an agent barely tried or carefully explored. š Anchor: You might not finish the whole project, but you still get credit for picking the right date and booking the room.
Continual Learning Results: Using a leading framework (MUSE) with GPT-4o, Day 2 sometimes got worse overall (Checkpoint Score dropped from 0.42 to 0.36). Why? Because the feedback from Day 1 targeted one missed checkpoint, but Day 2 failed at a different step after randomizationāso yesterdayās tip didnāt match todayās need. Interestingly, hard tasks improved slightly (0.20 ā 0.24) because Day 1 produced richer lessons, while easy tasks got worse (0.61 ā 0.50) due to misleading or sparse experience.
Human Guidance Helps a Lot: Tiered hints on hard tasks boosted average scores from 0.24 (no hints) to 0.83 with three levels of guidanceālike moving from a D to an A-. Multiple rounds of self-evolution alone added only +0.04. This exposes current limits in autonomous exploration and precise execution.
Surprises:
- More steps and tool calls correlated with better outcomes for the top modelāeffortful exploration pays off.
- Continual learning can hurt easy tasks if experiences donāt generalize.
- Even with strong reasoning, precise calculations (e.g., travel distances, budget exposure) are brittle without the right data and verification.
05Discussion & Limitations
Limitations:
- Scenario complexity: Current task compositions donāt yet include deep causal chains; future versions will add richer dependencies.
- Manual rule authoring: Meta-task rules are handcrafted, which limits scalability; automated rule generation would expand coverage.
- Scope: Experiments are focused on office-like settings and a subset of agent frameworks.
Resources Needed:
- Access to strong MLLMs or APIs, a tool-execution sandbox, and enough compute for long interaction traces.
- Logging and storage for histories, feedback, and experience summaries.
When Not to Use:
- If you only need to test single, fully-specified tasks with no interruptions.
- If your agent cannot call tools or interact with files/NPCs; Trainee-Bench requires exploration.
- If you seek pure final-answer accuracy without process/step analysis.
Open Questions:
- How can agents decide when to explore vs when to act to minimize wasted steps?
- What experience formats (rules, skills, memories) transfer best across randomized tasks?
- Can we design self-supervised signals that replace human hints but still drive big gains?
- How should agents manage long-term memory to survive frequent context switches?
- Which curricula of tasks accelerate robust scheduling and exploration skills?
06Conclusion & Future Work
3-Sentence Summary: Trainee-Bench tests AI agents in realistic, dynamic workplace scenarios where tasks stream in, key clues are hidden, and priorities change. It measures three vital abilitiesāscheduling, active exploration, and continual learningāusing randomized meta-tasks, composite timelines, and checkpoint-based feedback. Results show that even top models struggle, especially on hard tasks and autonomous learning, while human hints dramatically improve performance.
Main Achievement: Shifting evaluation from neat, static tests to a production-style environment with partial observability and temporal complexityāplus fine-grained, feedback-rich scoring.
Future Directions: Add deeper causal chains and broader domains, automate rule generation, refine learning protocols that generalize across randomization, and develop exploration strategies that balance curiosity with efficiency.
Why Remember This: If we want dependable AI coworkers, we must train and test them in the messy reality theyāll faceāwhere information is missing, clocks are ticking, and yesterdayās lessons must guide todayās choices.
Practical Applications
- ā¢Evaluate in-house AI assistants on realistic, multi-task office days before deploying them to staff calendars and email.
- ā¢Train agents to always search manuals and message the right person when info is missing, reducing hallucinations.
- ā¢Use checkpoint feedback to build checklists and playbooks that agents follow to avoid repeated errors.
- ā¢Benchmark different LLM backends and prompting strategies on the same dynamic scenarios to choose the most reliable stack.
- ā¢Stress-test scheduling logic with timed meetings and interrupts to improve calendar and ticket-triage bots.
- ā¢Design tiered-hint workflows where human supervisors nudge stuck agents, maximizing performance with minimal intervention.
- ā¢Create curricula of meta-tasks to incrementally teach agents general strategies (e.g., āverify data before computingā).
- ā¢Instrument tool calling (files, messaging, calendars) and analyze traces to fix brittle steps like parameter formatting.
- ā¢Adopt memory modules that store high-value lessons and auto-apply them when similar tasks recur.
- ā¢Run A/B tests on exploration strategies (ask NPC vs. search files first) to minimize wasted steps and timeouts.