Papers6

#Diffusion Transformer (DiT)

Solaris: Building a Multiplayer Video World Model in Minecraft

Georgy Savva, Oscar Michel et al.Feb 25arXiv

Solaris is a new AI that can imagine the future videos of two Minecraft players at the same time, keeping both cameras consistent with each other.

#multiplayer world model#video diffusion transformer#Minecraft dataset

Not triaged yet

Qwen3-TTS Technical Report

Intermediate

Hangrui Hu, Xinfa Zhu et al.Jan 22arXiv

Qwen3-TTS is a family of text-to-speech models that can talk in 10+ languages, clone a new voice from just 3 seconds, and follow detailed style instructions in real time.

#Qwen3-TTS#text-to-speech#voice cloning

Not triaged yet

Pretraining Frame Preservation in Autoregressive Video Memory Compression

Intermediate

Lvmin Zhang, Shengqu Cai et al.Dec 29arXiv

The paper teaches a video model to squeeze long video history into a tiny memory while still keeping sharp details in single frames.

#autoregressive video generation#video memory compression#frame retrieval pretraining

Not triaged yet

RecTok: Reconstruction Distillation along Rectified Flow

Intermediate

Qingyu Shi, Size Wu et al.Dec 15arXiv

RecTok is a new visual tokenizer that teaches the whole training path of a diffusion model (the forward flow) to be smart about image meaning, not just the starting latent features.

#Rectified Flow#Flow Matching#Visual Tokenizer

Not triaged yet

Stronger Normalization-Free Transformers

Intermediate

Mingzhi Chen, Taiming Lu et al.Dec 11arXiv

This paper shows that we can remove normalization layers from Transformers and still train them well by using a simple point‑by‑point function called Derf.

#Normalization‑free Transformers#LayerNorm replacement#Point‑wise activation

Not triaged yet

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Beginner

Jiehui Huang, Yuechen Zhang et al.Dec 8arXiv

UnityVideo is a single, unified model that learns from many kinds of video information at once—like colors (RGB), depth, motion (optical flow), body pose, skeletons, and segmentation—to make smarter, more realistic videos.

#multimodal video generation#multi-task learning#dynamic noise scheduling

Not triaged yet