Papers776

Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

Chenlong Wang, Yuhang Chen et al.Feb 2arXiv

This paper shows that many AI models that both read images and write images are not truly unified inside—they often understand well but fail to generate (or the other way around).

#Unified Multimodal Models#GAPEVAL#Gap Score

An Empirical Study of World Model Quantization

Intermediate

Zhongqian Fu, Tianyi Zhao et al.Feb 2arXiv

World models are AI tools that imagine the future so a robot can plan what to do next, but they are expensive to run many times in a row.

#world models#post-training quantization#DINO-WM

No Global Plan in Chain-of-Thought: Uncover the Latent Planning Horizon of LLMs

Intermediate

Liyan Xu, Mo Yu et al.Feb 2arXiv

Large language models don’t map out a full step-by-step plan before they start thinking; they mostly plan just a little bit ahead.

#chain-of-thought#latent planning horizon#Tele-Lens

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

Intermediate

FSVideo Team, Qingyu Chen et al.Feb 2arXiv

FSVideo is a new image-to-video generator that runs about 42× faster than popular open-source models while keeping similar visual quality.

#FSVideo#image-to-video#video diffusion transformer

Closing the Loop: Universal Repository Representation with RPG-Encoder

Intermediate

Jane Luo, Chengyu Yin et al.Feb 2arXiv

The paper introduces RPG-Encoder, a way to turn a whole code repository into one clear map that mixes meaning (semantics) with structure (dependencies).

#Repository Planning Graph#RPG-Encoder#semantic lifting

daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently

Intermediate

Mohan Jiang, Dayuan Fu et al.Feb 2arXiv

Long tasks trip up most AIs because they lose track of goals and make small mistakes that snowball over many steps.

#long-horizon agency#pull request chains#software evolution

Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

Intermediate

Wei Liu, Peijie Yu et al.Feb 2arXiv

The paper asks AI to hunt for insights in big databases without being told exact questions, like a curious scientist instead of a test-taker.

#Deep Data Research#Agentic LLMs#Investigatory Intelligence

DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers

Intermediate

Ionut-Vlad Modoranu, Philip Zmushko et al.Feb 2arXiv

Shampoo is a smart optimizer that can train models better than AdamW, but it used to be slow because it must compute tricky inverse matrix roots.

#Shampoo optimizer#second-order optimization#inverse matrix roots

Enhancing Multi-Image Understanding through Delimiter Token Scaling

Intermediate

Minyoung Lee, Yeji Park et al.Feb 2arXiv

Large Vision-Language Models (LVLMs) are great with one picture but get confused when you give them several, often mixing details from different images.

#Large Vision-Language Models#Multi-image understanding#Delimiter tokens

How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

Intermediate

Huanyu Zhang, Xuehai Bai et al.Feb 2arXiv

VIBE is a new test that checks how well image-editing AI models follow visual instructions like arrows, boxes, and sketches—not just text.

#visual instruction following#image editing benchmark#deictic grounding

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Intermediate

Dvir Samuel, Issar Tzachor et al.Feb 2arXiv

The paper makes long video generation much faster and lighter on memory by cutting out repeated work in attention.

#autoregressive video diffusion#KV cache compression#sparse attention

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Intermediate

Yuling Shi, Chaoxiang Xie et al.Feb 2arXiv

The paper tests a simple but bold idea: show code to AI as pictures instead of plain text, then shrink those pictures to save tokens and time.

#multimodal language models#code as images#visual code understanding

6 7 8 9 10