This paper fixes a hidden flaw in a popular image tokenizer (FSQ) with a simple one-line change to its activation function.
Videos are made of very long lists of tokens, and regular attention looks at every pair of tokens, which is slow and expensive.
Diffusion models make pictures from noise but often miss what people actually want in the prompt or what looks good to humans.
Alterbute is a diffusion-based method that changes an object's intrinsic attributes (color, texture, material, shape) in a photo while keeping the object's identity and the scene intact.
DiffProxy turns tricky multi-camera photos of a person into a clean 3D body and hands by first painting a precise 'map' on each pixel and then fitting a standard body model to that map.
The paper turns one flat picture into a neat stack of see‑through layers, so you can edit one thing without messing up the rest.
This paper is about making the words you type into a generator turn into the right pictures and videos more reliably.
Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.
StereoSpace turns a single photo into a full 3D-style stereo pair without ever estimating a depth map.
Diffusion models sometimes copy training images too closely, which can be a privacy and copyright problem.
TreeGRPO teaches image generators using a smart branching tree so each training run produces many useful learning signals instead of just one.