This paper teaches a computer to find the same object when seen from two very different cameras, like a body camera (first-person) and a room camera (third-person).
Deep research agents write long reports, but old tests often judge only how smooth they sound and whether they add links, not whether the facts are true today or the logic really holds.
The paper treats the last layer of a Large Language Model (the softmax over tokens) as an Energy-Based Model, which lets us measure a new signal called spilled energy.
GEARS is a new way to improve big ranking systems (like what shows up first in your feed) by letting an AI agent explore options safely, instead of humans tweaking knobs by hand.
SARAH is a real-time system that makes virtual characters move their whole bodies naturally during a conversation while knowing where the user is.
This paper builds a "generated reality" system that lets AI-made videos react to your real head and hand movements in VR.
Decoding (how a language model picks the next word) isn’t a bag of tricks; it’s a clean optimisation problem over probabilities.
The paper proposes HyTRec, a recommender system that reads very long histories fast while still paying sharp attention to the latest clicks and purchases.
This paper studies Vision–Language–Action (VLA) robots under one fair setup to find which design choices truly matter.
EgoPush teaches a small mobile robot to push multiple objects into patterns (like a cross or a line) using only what it sees from its own camera, without any global map.
VidEoMT shows that a single, well‑trained Vision Transformer (ViT) can segment and track objects in videos without extra tracking gadgets.
MolHIT is a new AI that builds molecules as graphs, moving from broad chemical groups to exact atoms step by step.