This paper builds a new test, LongShOTBench, to check if AI can truly understand long videos by using sight, speech, and sounds together.
AniX is a system that lets you place any character into any 3D world and control them with plain language, like “run forward” or “play a guitar.”
Alchemist is a smart data picker for training text-to-image models that learns which pictures and captions actually help the model improve.
FlashPortrait makes talking-portrait videos that keep a person’s identity steady for as long as you want—minutes or even hours.
Reward models are like scorekeepers that tell AI which answers people like more, and this paper builds the first big test for scorekeepers that judge both pictures and words together.
RePlan is a plan-then-execute system that first figures out exactly where to edit in a picture and then makes clean changes there.
This paper introduces LAMER, a Meta-RL training framework that teaches language agents to explore first and then use what they learned to solve tasks faster.
Robots learn best from what they would actually see, which is a first-person (egocentric) view, but most AI models are trained on third-person videos and get confused.
Kling-Omni is a single, unified model that can understand text, images, and videos together and then make or edit high-quality videos from those mixed instructions.
Traditional self-driving used separate boxes for seeing, thinking, and acting, but tiny mistakes in early boxes could snowball into big problems later.
DataFlow is a building-block system that helps large language models get better data by unifying how we create, clean, check, and organize that data.
JustRL shows that a tiny, steady recipe for reinforcement learning (RL) can make a 1.5B-parameter language model much better at math without fancy tricks.