The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.
Steer3D lets you change a 3D object just by typing what you want, like “add a roof rack,” and it does it in one quick pass.
Digital humans used to just copy motions; this paper makes them think, speak, and move in sync like real people.
RoboTracer is a vision-language model that turns tricky, word-only instructions into safe, step-by-step 3D paths (spatial traces) robots can follow.
The paper introduces Nemotron-Cascade, a step-by-step (cascaded) reinforcement learning recipe that trains an AI across domains like alignment, instructions, math, coding, and software engineering—one at a time.
LongVie 2 is a video world model that can generate controllable videos for 3–5 minutes while keeping the look and motion steady over time.
ReFusion is a new way for AI to write text faster by planning in chunks (called slots) and then filling each chunk carefully.
This survey explains how AI agents remember things and organizes the whole topic into three clear parts: forms, functions, and dynamics.
Janus splits a Mixture-of-Experts (MoE) model into two parts—attention and experts—so each can use just the right amount of GPUs.
Seedance 1.5 pro is a single model that makes video and sound together at the same time, so lips, music, and actions match naturally.
Different programming languages scale differently when training code AI models, so treating them all the same wastes compute and lowers performance.
RecTok is a new visual tokenizer that teaches the whole training path of a diffusion model (the forward flow) to be smart about image meaning, not just the starting latent features.