Papers9

#distillation

InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

Sirui Xu, Samuel Schulter et al.Feb 5arXiv

InterPrior is a new brain for simulated humans and humanoid robots that can move, balance, and use objects by following simple goals instead of step-by-step instructions.

#human-object interaction#physics-based control#goal-conditioned policy

Transition Matching Distillation for Fast Video Generation

Intermediate

Weili Nie, Julius Berner et al.Jan 14arXiv

Big video makers (diffusion models) create great videos but are too slow because they use hundreds of tiny clean-up steps.

#video diffusion#distillation#transition matching

The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

Intermediate

Qiguang Chen, Yantao Du et al.Jan 9arXiv

This paper says long chain-of-thought (Long CoT) works best when it follows a 'molecular' pattern with three kinds of thinking bonds: Deep-Reasoning, Self-Reflection, and Self-Exploration.

#Long Chain-of-Thought#reasoning bonds#Deep Reasoning

FlowBlending: Stage-Aware Multi-Model Sampling for Fast and High-Fidelity Video Generation

Intermediate

Jibin Song, Mingi Kwon et al.Dec 31arXiv

FlowBlending is a simple way to speed up video diffusion models by smartly choosing when to use a big model and when a small one is enough.

#FlowBlending#stage-aware sampling#video diffusion

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Intermediate

Hau-Shiang Shiu, Chin-Yang Lin et al.Dec 29arXiv

This paper makes diffusion-based video super-resolution (VSR) practical for live, low-latency use by removing the need for future frames and cutting denoising from ~50 steps down to just 4.

#video super-resolution#diffusion model#latent diffusion

In Pursuit of Pixel Supervision for Visual Pre-training

Intermediate

Lihe Yang, Shang-Wen Li et al.Dec 17arXiv

Pixels are the raw stuff of images, and this paper shows you can learn great vision skills by predicting pixels directly, not by comparing fancy hidden features.

#pixel supervision#masked autoencoders#MAE redesign

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Intermediate

Team Seedance, Heyi Chen et al.Dec 15arXiv

Seedance 1.5 pro is a single model that makes video and sound together at the same time, so lips, music, and actions match naturally.

#audio-visual generation#diffusion transformer#cross-modal synchronization

KlingAvatar 2.0 Technical Report

Intermediate

Kling Team, Jialu Chen et al.Dec 15arXiv

KlingAvatar 2.0 is a system that makes long, sharp, lifelike talking-person videos that follow audio, images, and text instructions all at once.

#audio-driven avatar#video diffusion#diffusion transformer

Few-Step Distillation for Text-to-Image Generation: A Practical Guide

Intermediate

Yifan Pu, Yizeng Han et al.Dec 15arXiv

Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.

#text-to-image#diffusion models#few-step generation