Papers906

Kimi K2.5: Visual Agentic Intelligence

Kimi Team, Tongtong Bai et al.Feb 2arXiv

Kimi K2.5 is a new open-source AI that can read both text and visuals (images and videos) and act like a team of helpers to finish big tasks faster.

#multimodal learning#vision-language models#joint optimization

Show, Don't Tell: Morphing Latent Reasoning into Image Generation

Intermediate

Harold Haodong Chen, Xinxiang Yin et al.Feb 2arXiv

LatentMorph teaches an image-making AI to quietly think in its head while it draws, instead of stopping to write out its thoughts in words.

#latent reasoning#text-to-image generation#autoregressive models

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Intermediate

Hongzhou Zhu, Min Zhao et al.Feb 2arXiv

The paper fixes a hidden mistake many fast video generators were making when turning a "see-everything" model into a "see-past-only" model.

#autoregressive video diffusion#causal attention#ODE distillation

TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

Intermediate

Hang Yan, Xinyu Che et al.Feb 2arXiv

This paper studies how AI agents get better while they are working, not just whether they finish the job.

#Test-Time Improvement#LLM agents#trajectory analysis

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Intermediate

Yu Zeng, Wenxuan Huang et al.Feb 2arXiv

The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.

#multimodal large language model#visual question answering#vision deep research

D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use

Intermediate

Bowen Xu, Shaoyu Wu et al.Feb 2arXiv

This paper fixes a common problem in reasoning AIs called Lazy Reasoning, where the model rambles instead of making a good plan.

#task decomposition#tool use#large reasoning models

LoopViT: Scaling Visual ARC with Looped Transformers

Intermediate

Wen-Jie Shu, Xuerui Qiu et al.Feb 2arXiv

Loop-ViT is a vision model that thinks in loops, so it can take more steps on hard puzzles and stop early on easy ones.

#ARC-AGI#visual reasoning#Looped Transformer

Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

Intermediate

Chenlong Wang, Yuhang Chen et al.Feb 2arXiv

This paper shows that many AI models that both read images and write images are not truly unified inside—they often understand well but fail to generate (or the other way around).

#Unified Multimodal Models#GAPEVAL#Gap Score

An Empirical Study of World Model Quantization

Intermediate

Zhongqian Fu, Tianyi Zhao et al.Feb 2arXiv

World models are AI tools that imagine the future so a robot can plan what to do next, but they are expensive to run many times in a row.

#world models#post-training quantization#DINO-WM

No Global Plan in Chain-of-Thought: Uncover the Latent Planning Horizon of LLMs

Intermediate

Liyan Xu, Mo Yu et al.Feb 2arXiv

Large language models don’t map out a full step-by-step plan before they start thinking; they mostly plan just a little bit ahead.

#chain-of-thought#latent planning horizon#Tele-Lens

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

Intermediate

FSVideo Team, Qingyu Chen et al.Feb 2arXiv

FSVideo is a new image-to-video generator that runs about 42× faster than popular open-source models while keeping similar visual quality.

#FSVideo#image-to-video#video diffusion transformer

Closing the Loop: Universal Repository Representation with RPG-Encoder

Intermediate

Jane Luo, Chengyu Yin et al.Feb 2arXiv

The paper introduces RPG-Encoder, a way to turn a whole code repository into one clear map that mixes meaning (semantics) with structure (dependencies).

#Repository Planning Graph#RPG-Encoder#semantic lifting

6 7 8 9 10