This paper fixes a common problem in multimodal AI: models can understand pictures and words well but stumble when asked to create matching images.
This paper teaches AI helpers to browse the web more like people do, not just by grabbing static snippets.
SmartSnap teaches an agent not only to finish a phone task but also to prove it with a few perfect snapshots it picks itself.
Big vision-language models are super smart but too large to fit on phones and small devices.
This paper asks whether we are judging AI answers the right way and introduces Sage, a new way to test AI judges without using human-graded answers.