Papers1055

SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

Guibin Chen, Dixuan Lin et al.Feb 25arXiv

SkyReels-V4 is a single, unified model that makes videos and matching sounds together, while also letting you fix or change parts of a video.

#multimodal diffusion transformer#video-audio generation#inpainting

From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors

Intermediate

Liangbing Zhao, Le Zhuo et al.Feb 25arXiv

The paper turns image editing from a one-step “before → after” trick into a mini physics simulation that follows real-world rules.

#physics-aware image editing#physical state transition#latent transition priors

Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

Intermediate

Euisoo Jung, Byunghyun Kim et al.Feb 25arXiv

Diffusion models make great images but are slow because they fix noise step by step many times.

#diffusion inference#multi-GPU acceleration#data parallelism

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Intermediate

Yongtong Wu, Shaoyuan Chen et al.Feb 25arXiv

Agent-style LLMs chat with tools over many short turns, so most tokens are repeats and the system spends more time fetching old memories (KV-Cache) than computing new answers.

#KV-Cache#prefill-decode disaggregation#dual-path loading

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Intermediate

Xiaoxuan Wang, Han Zhang et al.Feb 25arXiv

This paper tackles why training AI agents that act over many steps (like browsing the web or moving in a house) often becomes unstable and collapses.

#Agentic Reinforcement Learning#Policy Gradient#Sequence-level Clipping

VecGlypher: Unified Vector Glyph Generation with Language Models

Intermediate

Xiaoke Huang, Bhavul Gauri et al.Feb 25arXiv

VecGlypher is a single language-model-based system that writes SVG code to draw crisp, editable letters (glyphs) directly from text descriptions or a few example images.

#VecGlypher#vector glyph generation#SVG path tokens

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Intermediate

Yuanda Xu, Hejian Sang et al.Feb 24arXiv

The paper shows that when training reasoning AIs with reinforcement learning, treating every wrong answer the same makes the AI overconfident in some bad paths and less diverse overall.

#ACE#Reinforcement Learning with Verifiable Rewards#GRPO

Test-Time Training with KV Binding Is Secretly Linear Attention

Intermediate

Junchen Liu, Sven Elflein et al.Feb 24arXiv

The paper shows that Test-Time Training (TTT) with key–value (KV) binding is not really memorizing like a notebook; it is acting like a learned linear attention layer.

#Test-Time Training#KV Binding#Linear Attention

On Data Engineering for Scaling LLM Terminal Capabilities

Intermediate

Renjie Pi, Grace Lam et al.Feb 24arXiv

This paper shows that you can vastly improve a model’s command-line (terminal) skills by carefully engineering the training data, not just by using a bigger model.

#Terminal-Bench 2.0#terminal agents#synthetic task generation

See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

Intermediate

Jaehyun Park, Minyoung Ahn et al.Feb 24arXiv

Modern image generators can still make strange mistakes like extra fingers or melted faces, and today’s vision-language models (VLMs) often miss them.

#visual artifacts#structural artifacts#diffusion transformer

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

Intermediate

Jihao Qiu, Lingxi Xie et al.Feb 24arXiv

LongVideo-R1 is a smart video-watching agent that jumps to the right moments in long videos instead of scanning everything.

#long video understanding#video navigation#multimodal large language model

PyVision-RL: Forging Open Agentic Vision Models via RL

Intermediate

Shitian Zhao, Shaoheng Lin et al.Feb 24arXiv

PyVision-RL teaches vision-language models to act like curious agents that think in multiple steps and use Python tools to inspect images and videos.

#agentic multimodal models#reinforcement learning#dynamic tooling

7 8 9 10 11