Papers8

#representation alignment

DreamWorld: Unified World Modeling in Video Generation

Boming Tan, Xiangdong Zhang et al.Feb 28arXiv

DreamWorld is a new way to make videos that not only look real but also follow common-sense rules about motion, space, and meaning.

#video diffusion transformer#world model#optical flow

Not triaged yet

FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

Intermediate

Han Zhao, Jingbo Wang et al.Feb 19arXiv

Robots learn better when they predict short, meaningful summaries of future images instead of drawing every pixel of the future scene.

#world modeling#vision-language-action (VLA)#diffusion policy

Not triaged yet

Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better

Intermediate

Ji Zhao, Yufei Gu et al.Feb 5arXiv

Big idea: use a small, already-trained model to help a bigger model learn good habits early, so the big one trains faster and ends up smarter.

#Late-to-Early Training#LLM pretraining acceleration#representation alignment

Not triaged yet

DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation

Intermediate

Hun Chang, Byunghee Cha et al.Jan 30arXiv

DINO-SAE is a new autoencoder that keeps both the meaning of an image (semantics) and tiny textures (fine details) at the same time.

#DINO-SAE#spherical manifold#cosine similarity alignment

Not triaged yet

UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

Intermediate

Ruiheng Zhang, Jingfeng Yao et al.Jan 16arXiv

UniX is a new medical AI that both understands chest X-rays (writes accurate reports) and generates chest X-ray images (high visual quality) without making the two jobs fight each other.

#UniX#autoregressive branch#diffusion branch

Not triaged yet

Boosting Latent Diffusion Models via Disentangled Representation Alignment

Intermediate

John Page, Xuesong Niu et al.Jan 9arXiv

This paper shows that the best VAEs for image generation are the ones whose latents neatly separate object attributes, a property called semantic disentanglement.

#Send-VAE#semantic disentanglement#latent diffusion

Not triaged yet

Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Beginner

Jinghan Li, Yang Jin et al.Dec 24arXiv

This paper introduces NExT-Vid, a way to teach a video model by asking it to guess the next frame of a video while parts of the past are hidden.

#autoregressive video pretraining#masked next-frame prediction#context isolation

Not triaged yet

What matters for Representation Alignment: Global Information or Spatial Structure?

Intermediate

Jaskirat Singh, Xingjian Leng et al.Dec 11arXiv

This paper asks whether generation training benefits more from an encoder’s big-picture meaning (global semantics) or from how features are arranged across space (spatial structure).

#representation alignment#REPA#iREPA

Not triaged yet