Papers9

#mixture-of-experts

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Shengbang Tong, David Fan et al.Mar 3arXiv

The paper trains one model from scratch to both read text and see images/videos, instead of starting from a language-only model.

#multimodal pretraining#representation autoencoder#RAE

FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

Intermediate

Han Zhao, Jingbo Wang et al.Feb 19arXiv

Robots learn better when they predict short, meaningful summaries of future images instead of drawing every pixel of the future scene.

#world modeling#vision-language-action (VLA)#diffusion policy

ERNIE 5.0 Technical Report

Intermediate

Haifeng Wang, Hua Wu et al.Feb 4arXiv

ERNIE 5.0 is a single giant model that can read and create text, images, video, and audio by predicting the next pieces step by step, like writing a story one line at a time.

#ERNIE 5.0#unified autoregressive model#mixture-of-experts

Advancing Open-source World Models

Intermediate

Robbyant Team, Zelin Gao et al.Jan 28arXiv

LingBot-World is an open-source world model that turns video generation into an interactive, real-time simulator.

#world model#video diffusion#causal attention

JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

Intermediate

Jiangshan Duo, Hanyu Li et al.Jan 13arXiv

JudgeRLVR teaches a model to be a strict judge of answers before it learns to generate them, which trims bad ideas early.

#RLVR#judge-then-generate#discriminative supervision

SWE-RM: Execution-free Feedback For Software Engineering Agents

Intermediate

KaShun Shum, Binyuan Hui et al.Dec 26arXiv

Coding agents used to fix software rely on feedback; unit tests give only pass/fail signals that are often noisy or missing.

#execution-free feedback#reward model#software engineering agents

Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

Intermediate

Zeyuan Allen-ZhuDec 19arXiv

The paper introduces Canon layers, tiny add-ons that let nearby words share information directly, like passing notes along a row of desks.

#Canon layers#horizontal information flow#transformer architecture

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

Intermediate

Tiwei Bie, Maosong Cao et al.Dec 10arXiv

Before this work, most big language models talked one word at a time (autoregressive), which made them slow and hard to parallelize.

#diffusion language model#masked diffusion#block diffusion

ProPhy: Progressive Physical Alignment for Dynamic World Simulation

Intermediate

Zijun Wang, Panwen Hu et al.Dec 5arXiv

ProPhy is a new two-step method that helps video AIs follow real-world physics, not just make pretty pictures.

#physics-aware video generation#mixture-of-experts#token-level routing