DREAM is one model that both understands images (like CLIP) and makes images from text (like top text-to-image models).
SenCache speeds up video diffusion models by reusing past answers only when the model is predicted to change very little.
The paper turns image editing from a one-step “before → after” trick into a mini physics simulation that follows real-world rules.
Big idea: Make image-making AIs stop, think, check, and fix their own work so they get better at both creating pictures and understanding them.
This paper fixes a hidden flaw in a popular image tokenizer (FSQ) with a simple one-line change to its activation function.
Videos are made of very long lists of tokens, and regular attention looks at every pair of tokens, which is slow and expensive.
Diffusion models make pictures from noise but often miss what people actually want in the prompt or what looks good to humans.
Alterbute is a diffusion-based method that changes an object's intrinsic attributes (color, texture, material, shape) in a photo while keeping the object's identity and the scene intact.
DiffProxy turns tricky multi-camera photos of a person into a clean 3D body and hands by first painting a precise 'map' on each pixel and then fitting a standard body model to that map.
The paper turns one flat picture into a neat stack of see‑through layers, so you can edit one thing without messing up the rest.
This paper is about making the words you type into a generator turn into the right pictures and videos more reliably.
Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.