Papers943

TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

TreeGRPO teaches image generators using a smart branching tree so each training run produces many useful learning signals instead of just one.

#TreeGRPO#reinforcement learning#diffusion models

Position: Universal Aesthetic Alignment Narrows Artistic Expression

Intermediate

Wenqi Marshall Guo, Qingyun Qian et al.Dec 9arXiv

The paper shows that many AI image generators are trained to prefer one popular idea of beauty, even when a user clearly asks for something messy, dark, blurry, or emotionally heavy.

#universal aesthetic alignment#aesthetic pluralism#reward models

Relational Visual Similarity

Intermediate

Thao Nguyen, Sicheng Mo et al.Dec 8arXiv

Most image-similarity tools only notice how things look (color, shape, class) and miss deeper, human-like connections.

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Beginner

Jiehui Huang, Yuechen Zhang et al.Dec 8arXiv

UnityVideo is a single, unified model that learns from many kinds of video information at once—like colors (RGB), depth, motion (optical flow), body pose, skeletons, and segmentation—to make smarter, more realistic videos.

#multimodal video generation#multi-task learning#dynamic noise scheduling

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Intermediate

Yuan Gao, Chen Chen et al.Dec 8arXiv

This paper shows that we can turn big, smart vision features into a small, easy-to-use code for image generation with just one attention layer.

#Feature Auto-Encoder#FAE#Self-Supervised Learning

Group Representational Position Encoding

Intermediate

Yifan Zhang, Zixiang Chen et al.Dec 8arXiv

GRAPE is a new way to tell Transformers where each word is in a sentence by using neat math moves called group actions.

#GRAPE#positional encoding#group actions

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Beginner

Zhaochong An, Menglin Jia et al.Dec 8arXiv

OneStory is a new way to make long videos from many shots that stay consistent with the story, characters, and places across time.

#multi-shot video generation#adaptive memory#frame selection

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Beginner

Charlie Zhang, Graham Neubig et al.Dec 8arXiv

The paper asks when reinforcement learning (RL) really makes language models better at reasoning beyond what they learned in pre-training.

#edge of competence#process-verified evaluation#process-level rewards

Distribution Matching Variational AutoEncoder

Beginner

Sen Ye, Jianning Pei et al.Dec 8arXiv

This paper shows a new way to teach an autoencoder to shape its hidden space (the 'latent space') to look like any distribution we want, not just a simple bell curve.

#Distribution Matching VAE#Latent Space#Self-Supervised Learning

DeepCode: Open Agentic Coding

Beginner

Zongwei Li, Zhonghang Li et al.Dec 8arXiv

DeepCode is an AI coding system that turns long, complicated papers into full, working code repositories.

#agentic coding#document-to-code#information-flow management

LongCat-Image Technical Report

Intermediate

Meituan LongCat Team, Hanghang Ma et al.Dec 8arXiv

LongCat-Image is a small (6B) but mighty bilingual image generator that turns text into high-quality, realistic pictures and can also edit images very well.

#LongCat-Image#diffusion model#text-to-image

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs

Intermediate

Xiaoran Liu, Yuerong Song et al.Dec 8arXiv

Big language models use RoPE to remember word order, but it throws away the imaginary half of a complex number during attention.

#RoPE++#Rotary Position Embeddings#Imaginary Attention

73 74 75 76 77