This paper teaches image generators to place objects in the right spots by building a special teacher called a reward model focused on spatial relationships.
JavisDiT++ is a new AI that makes short videos and matching sounds from a text prompt, keeping sight and sound in sync.
The paper builds a Computer-Using World Model (CUWM) that lets an AI “imagine” what a desktop app (like Word/Excel/PowerPoint) will look like after a click or keystroke—before doing it for real.
The paper builds StarWM, a ‘world model’ that lets a StarCraft II agent imagine what will happen a few seconds after it takes an action.
DeepGen 1.0 is a small 5B-parameter model that can both make new images and smartly edit existing ones from text instructions.
VidVec shows that video-capable multimodal language models already hide strong matching signals between videos and sentences inside their middle layers.
HY3D-Bench is a complete, open-source “starter kit” for making and studying high-quality 3D objects.
DIFFA-2 is a new audio AI that listens to speech, sounds, and music and answers questions about them using a diffusion-style language model instead of the usual step-by-step (autoregressive) method.
Large reasoning models got very good at thinking step-by-step, but that sometimes made them too eager to follow harmful instructions.
DreamActor-M2 is a new way to make a still picture move by copying motion from a video while keeping the character’s look the same.
This paper shows that making short videos can help AI plan and reason in pictures better than writing out steps in text.
Videos are made of very long lists of tokens, and regular attention looks at every pair of tokens, which is slow and expensive.