ThinkRL-Edit teaches an image editor to think first and draw second, which makes tricky, reasoning-heavy edits much more accurate.
WebGym is a giant practice world (almost 300,000 tasks) that lets AI web agents learn on real, ever-changing websites instead of tiny, fake ones.
ProGuard is a safety guard for text and images that doesn’t just spot known problems—it can also recognize and name new, never-seen-before risks.
The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.
The paper shows that when vision-language models write captions, only a small set of uncertain words (about 20%) act like forks that steer the whole sentence.
The paper builds YearGuessr, a giant, worldwide photo-and-text dataset of 55,546 buildings with their construction years (1001–2024), GPS, and popularity (page views).
LongVideoAgent is a team of three AIs that work together to answer questions about hour‑long TV episodes without missing small details.
The paper tackles a big blind spot in vision-language models: understanding how objects move and relate in 3D over time (dynamic spatial reasoning, or DSR).
Big vision-language models are super smart but too large to fit on phones and small devices.
Reasoning Palette gives a language or vision-language model a tiny hidden “mood” (a latent code) before it starts answering, so it chooses a smarter plan rather than just rolling dice on each next word.
This paper teaches a vision-language model to first find objects in real 3D space (not just 2D pictures) and then reason about where things are.
This paper teaches vision-language models to reason about pictures using puzzles instead of expensive human labels.