How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers6

All Beginner Intermediate Advanced

All Sources arXiv

#long-context modeling

Reinforced Fast Weights with Next-Sequence Prediction

Hee Seung Hwang, Xindi Wu et al.Feb 18arXiv

Fast weight models remember context with a tiny, fixed memory, but standard next-token training teaches them to think only one word ahead.

#fast weight models#next-sequence prediction#reinforcement learning for LMs

Not triaged yet

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

MiniCPM Team, Wenhao An et al.Feb 12arXiv

MiniCPM-SALA is a 9B-parameter language model that mixes two kinds of attention—sparse and linear—to read very long texts quickly and accurately.

#long-context modeling#sparse attention#linear attention

Not triaged yet

Context Forcing: Consistent Autoregressive Video Generation with Long Context

Shuo Chen, Cong Wei et al.Feb 5arXiv

The paper fixes a big problem in long video generation: models either forget what happened or slowly drift off-topic over time.

#autoregressive video generation#long-context modeling#distribution matching distillation

Not triaged yet

Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories

Sidi Lu, Zhenwen Liang et al.Feb 4arXiv

Locas is a new kind of add-on memory for language models that learns during use but touches none of the model’s original weights.

#Locas#parametric memory#test-time training

Not triaged yet

SWE-RM: Execution-free Feedback For Software Engineering Agents

KaShun Shum, Binyuan Hui et al.Dec 26arXiv

Coding agents used to fix software rely on feedback; unit tests give only pass/fail signals that are often noisy or missing.

#execution-free feedback#reward model#software engineering agents

Not triaged yet

Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

Jingdi Lei, Di Zhang et al.Dec 14arXiv

Standard attention is slow for long texts because it compares every word with every other word, which takes quadratic time.

#error-free linear attention#rank-1 matrix exponential#continuous-time dynamics

Not triaged yet