D4RT is a new AI model that turns regular videos into moving 3D scenes (4D) quickly and accurately.
InfiniteVL is a vision-language model that mixes two ideas: local focus with Sliding Window Attention and long-term memory with a linear module called Gated DeltaNet.
Robots that follow pictures and words (VLA models) can do many tasks, but they often bump into things because safety isn’t guaranteed.
Wan-Move is a new way to control how things move in AI-generated videos by guiding motion directly inside the model’s hidden features.
This paper teaches a vision-language model to think about images by talking to copies of itself, using only words to plan and decide.
TrackingWorld turns a regular single-camera video into a map of where almost every pixel moves in 3D space over time.
OpenSubject is a giant video-based dataset (2.5M samples, 4.35M images) built to help AI make pictures that keep each person or object looking like themselves, even in busy scenes.
EgoX turns a regular third-person video into a first-person video that looks like it was filmed from the actor’s eyes.
Robots that follow spoken instructions used to be slow and jerky because one big model tried to think and move at the same time.
TreeGRPO teaches image generators using a smart branching tree so each training run produces many useful learning signals instead of just one.
The paper shows that many AI image generators are trained to prefer one popular idea of beauty, even when a user clearly asks for something messy, dark, blurry, or emotionally heavy.
Most image-similarity tools only notice how things look (color, shape, class) and miss deeper, human-like connections.