DeepResearch agents write long, evidence-based reports, but teaching and grading them is hard because there is no single 'right answer' to score against.
This paper shows that many AI models that both read images and write images are not truly unified inside—they often understand well but fail to generate (or the other way around).
This paper builds a live challenge that tests how well Deep Research Agents (DRAs) can write expert-level Wikipedia-style articles.
Big models are often used to grade AI answers, but they are expensive, slow, and depend too much on tricky prompts.
CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.
This paper introduces MMDeepResearch-Bench (MMDR-Bench), a new test that checks how well AI “deep research agents” write long, citation-rich reports using both text and images.
This survey explains how AI judges are changing from single smart readers (LLM-as-a-Judge) into full-on agents that can plan, use tools, remember, and work in teams (Agent-as-a-Judge).
KnowMe-Bench is a new test that checks if AI helpers truly understand a person, not just remember facts.
This paper fixes a common problem in multimodal AI: models can understand pictures and words well but stumble when asked to create matching images.
This paper teaches AI helpers to browse the web more like people do, not just by grabbing static snippets.
SmartSnap teaches an agent not only to finish a phone task but also to prove it with a few perfect snapshots it picks itself.
Big vision-language models are super smart but too large to fit on phones and small devices.