Diffusion language models write by gradually unmasking hidden words, so deciding which blanks to reveal next is a big deal for both speed and accuracy.
D4RT is a new AI model that turns regular videos into moving 3D scenes (4D) quickly and accurately.
InfiniteVL is a vision-language model that mixes two ideas: local focus with Sliding Window Attention and long-term memory with a linear module called Gated DeltaNet.
Robots that follow pictures and words (VLA models) can do many tasks, but they often bump into things because safety isn’t guaranteed.
Wan-Move is a new way to control how things move in AI-generated videos by guiding motion directly inside the model’s hidden features.
This paper teaches a vision-language model to think about images by talking to copies of itself, using only words to plan and decide.
Visionary is a web-based platform that lets you view and interact with advanced 3D scenes, right in your browser, with just a click.
TrackingWorld turns a regular single-camera video into a map of where almost every pixel moves in 3D space over time.
Multi-agent AI teams are not automatically better; their success depends on matching the team’s coordination style to the job’s structure.
OpenSubject is a giant video-based dataset (2.5M samples, 4.35M images) built to help AI make pictures that keep each person or object looking like themselves, even in busy scenes.
EgoX turns a regular third-person video into a first-person video that looks like it was filmed from the actor’s eyes.
Robots that follow spoken instructions used to be slow and jerky because one big model tried to think and move at the same time.