SkyReels-V4 is a single, unified model that makes videos and matching sounds together, while also letting you fix or change parts of a video.
The paper turns image editing from a one-step “before → after” trick into a mini physics simulation that follows real-world rules.
Diffusion models make great images but are slow because they fix noise step by step many times.
Agent-style LLMs chat with tools over many short turns, so most tokens are repeats and the system spends more time fetching old memories (KV-Cache) than computing new answers.
This paper tackles why training AI agents that act over many steps (like browsing the web or moving in a house) often becomes unstable and collapses.
VecGlypher is a single language-model-based system that writes SVG code to draw crisp, editable letters (glyphs) directly from text descriptions or a few example images.
The paper shows that when training reasoning AIs with reinforcement learning, treating every wrong answer the same makes the AI overconfident in some bad paths and less diverse overall.
The paper shows that Test-Time Training (TTT) with key–value (KV) binding is not really memorizing like a notebook; it is acting like a learned linear attention layer.
This paper shows that you can vastly improve a model’s command-line (terminal) skills by carefully engineering the training data, not just by using a bigger model.
Modern image generators can still make strange mistakes like extra fingers or melted faces, and today’s vision-language models (VLMs) often miss them.
LongVideo-R1 is a smart video-watching agent that jumps to the right moments in long videos instead of scanning everything.
PyVision-RL teaches vision-language models to act like curious agents that think in multiple steps and use Python tools to inspect images and videos.