This paper checks if a popular text-to-image model called Nano Banana Pro can fix messy photos without any extra training.
This paper teaches vision-language models to reason about pictures using puzzles instead of expensive human labels.
MemFlow is a new way for AI to remember the right parts of a long video story while it keeps making new parts, so characters and scenes stay consistent.
TimeLens studies how to teach AI not just what happens in a video, but exactly when it happens, which is called video temporal grounding (VTG).
This paper shows a simple, math-guided way to turn image pieces into tidy symbols (tokens) using points spread evenly on a sphere.
CRISP turns a normal phone video of a person into a clean 3D world and a virtual human that can move in it without breaking physics.
MMGR is a new benchmark that checks whether AI image and video generators follow real-world rules, not just whether their outputs look pretty.
Autoregressive (AR) models normally write one token at a time, which is accurate but slow for long answers.
Robots usually learn by copying many demonstrations, which is expensive and makes them brittle when things change.
Large language models get smarter when they get bigger, but storing all those extra weights eats tons of memory.
RecGPT‑V2 turns a recommender system into a smart team: a planner, several specialists, and a fair judge that all work together.
This paper builds A4-Agent, a smart three-part helper that figures out where to touch or use an object just from a picture and a written instruction, without any extra training.