Papers19

#Flow Matching

CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance

Hanyang Wang, Yiyang Liu et al.Mar 3arXiv

This paper turns a popular image-guidance trick (Classifier-Free Guidance) into a feedback-control problem, just like keeping a car steady in its lane.

#Classifier-Free Guidance#Sliding Mode Control#Diffusion Models

Not triaged yet

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

Intermediate

GigaBrain Team, Boyuan Wang et al.Feb 12arXiv

GigaBrain-0.5M* is a robot brain that sees, reads, and acts, and it gets smarter by imagining the future before moving.

#Vision-Language-Action#World Model#Reinforcement Learning

Not triaged yet

Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

Intermediate

Yunze Tong, Mushui Liu et al.Feb 6arXiv

Text-to-image models using GRPO used to give the same final reward to every step, which is like giving the whole team the same grade no matter who did what.

#TurningPoint-GRPO#GRPO#Flow Matching

Not triaged yet

Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers

Intermediate

Bozhou Li, Yushuo Guan et al.Feb 3arXiv

The paper shows that using information from many layers of a language model (not just one) helps text-to-image diffusion transformers follow prompts much better.

#Diffusion Transformer#Text Conditioning#Multi-layer LLM Features

Not triaged yet

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

Intermediate

I. Apanasevich, M. Artemyev et al.Jan 31arXiv

Green-VLA is a step-by-step training recipe that teaches one model to see, understand language, and move many kinds of robots safely and efficiently.

#Vision-Language-Action#Unified Action Space#Multi-embodiment Pretraining

Not triaged yet

A Pragmatic VLA Foundation Model

Intermediate

Wei Wu, Fan Lu et al.Jan 26arXiv

LingBot-VLA is a robot brain that listens to language, looks at the world, and decides smooth actions to get tasks done.

#Vision‑Language‑Action#foundation model#Flow Matching

Not triaged yet

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Intermediate

Shengbang Tong, Boyang Zheng et al.Jan 22arXiv

Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.

#Representation Autoencoder#RAE#Variational Autoencoder

Not triaged yet

Qwen3-TTS Technical Report

Intermediate

Hangrui Hu, Xinfa Zhu et al.Jan 22arXiv

Qwen3-TTS is a family of text-to-speech models that can talk in 10+ languages, clone a new voice from just 3 seconds, and follow detailed style instructions in real time.

#Qwen3-TTS#text-to-speech#voice cloning

Not triaged yet

ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

Intermediate

Linqing Zhong, Yi Liu et al.Jan 16arXiv

Robots usually think in words and pictures, but their hands need exact motions, so there is a gap between understanding and doing.

#Vision-Language-Action#Action Chain-of-Thought#Explicit Action Reasoner

Not triaged yet

SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices

Intermediate

Dongting Hu, Aarush Gupta et al.Jan 13arXiv

This paper shows how to make powerful image‑generating Transformers run fast on phones without needing the cloud.

#Diffusion Transformer#Sparse Attention#Adaptive Sparse Self-Attention

Not triaged yet

TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts

Intermediate

Yu Xu, Hongbin Yan et al.Jan 12arXiv

TAG-MoE is a new way to steer Mixture-of-Experts (MoE) models using clear task hints, so the right “mini-experts” handle the right parts of an image job.

#Task-Aware Gating#Mixture-of-Experts#Unified Image Generation

Not triaged yet

FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing

Intermediate

Xijie Huang, Chengming Xu et al.Jan 5arXiv

This paper makes video editing easier by teaching an AI to spread changes from the first frame across the whole video smoothly and accurately.

#First-Frame Propagation#Video Editing#FFP-300K

Not triaged yet

1 2