DeepResearch agents write long, evidence-based reports, but teaching and grading them is hard because there is no single 'right answer' to score against.
This paper shows that many AI models that both read images and write images are not truly unified inside—they often understand well but fail to generate (or the other way around).
Big models are often used to grade AI answers, but they are expensive, slow, and depend too much on tricky prompts.
CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.
This paper introduces MMDeepResearch-Bench (MMDR-Bench), a new test that checks how well AI “deep research agents” write long, citation-rich reports using both text and images.
KnowMe-Bench is a new test that checks if AI helpers truly understand a person, not just remember facts.
SmartSnap teaches an agent not only to finish a phone task but also to prove it with a few perfect snapshots it picks itself.
Big vision-language models are super smart but too large to fit on phones and small devices.
This paper asks whether we are judging AI answers the right way and introduces Sage, a new way to test AI judges without using human-graded answers.