The paper introduces UnifiedReward-Flex, a reward model that judges images and videos the way a thoughtful human would—by flexibly changing what it checks based on the prompt and the visual evidence.
Before this work, computer-using AIs mostly copied old examples and struggled with long step-by-step tasks on real computers.
VIBE is a tiny but mighty image editor that listens to your words and changes pictures while keeping the original photo intact unless you ask otherwise.
This paper builds a real-time talking-listening head avatar that reacts naturally to your words, tone, nods, and smiles in about half a second.
This paper teaches text-to-video models to follow real-world physics, so people, balls, water, glass, and fire act the way they should.
Before this work, most big language models talked one word at a time (autoregressive), which made them slow and hard to parallelize.