Papers1262

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

This paper upgrades a small but mighty vision-language model called PaddleOCR-VL-1.5 to read and understand real-world, messy documents better than any model before it.

#document parsing#vision-language model#layout analysis

Not triaged yet

Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

Intermediate

Shuangshuang Ying, Zheyu Wang et al.Jan 29arXiv

This paper builds a safe science “playground” called DeR that fairly tests how AI finds facts (retrieval) and how it thinks with those facts (reasoning) without mixing them up.

#retrieval-augmented generation#document-grounded reasoning#deep research benchmark

Not triaged yet

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Intermediate

Honglin Lin, Zheng Liu et al.Jan 29arXiv

MMFineReason is a huge, open dataset (1.8 million examples, 5.1 billion solution tokens) that teaches AIs to think step by step about pictures and text together.

#multimodal reasoning#vision-language models#chain-of-thought

Not triaged yet

Language-based Trial and Error Falls Behind in the Era of Experience

Intermediate

Haoyu Wang, Guozheng Ma et al.Jan 29arXiv

Big language models are great at words but waste lots of time and energy when they try random actions in non-language games like Sudoku, Sokoban, 2048, FrozenLake, and Rubik’s Cube.

#SCOUT#Reinforcement Learning#Supervised Fine-Tuning

Not triaged yet

DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning

Intermediate

Mingshuang Luo, Shuang Liang et al.Jan 29arXiv

DreamActor-M2 is a new way to make a still picture move by copying motion from a video while keeping the character’s look the same.

#character image animation#spatiotemporal in-context learning#video diffusion

Not triaged yet

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Intermediate

Yufeng Zhong, Lei Chen et al.Jan 29arXiv

OCRVerse is a new AI model that can read both plain text in documents and the visual structures in charts, webpages, and science plots, all in one system.

#Holistic OCR#Vision-Language Model#Supervised Fine-Tuning

Not triaged yet

Beyond Imitation: Reinforcement Learning for Active Latent Planning

Intermediate

Zhi Zheng, Wee Sun LeeJan 29arXiv

The paper shows how to make AI think faster and smarter by planning in a hidden space instead of writing long step-by-step sentences.

#latent reasoning#chain-of-thought#variational autoencoder

Not triaged yet

Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening

Intermediate

Xiaotong Ji, Rasul Tutunov et al.Jan 29arXiv

The paper shows a fast, training-free way to boost an LLM’s step-by-step reasoning by smartly reusing the model’s own probabilities.

#power distribution sampling#distribution sharpening#low-temperature sampling

Not triaged yet

KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices

Intermediate

Wuyang Zhou, Yuxuan Gu et al.Jan 29arXiv

Hyper-Connections (HC) make the usual single shortcut in neural networks wider by creating several parallel streams and letting the model mix them, but this can become unstable when stacked deep.

#Hyper-Connections#Manifold-Constrained Hyper-Connections#Doubly Stochastic Matrix

Not triaged yet

Shaping capabilities with token-level data filtering

Intermediate

Neil Rathi, Alec RadfordJan 29arXiv

The paper shows a simple way to teach AI models what not to learn by removing only the exact words (tokens) related to unwanted topics during pretraining.

#token-level data filtering#capability shaping#sparse autoencoders

Not triaged yet

ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

Intermediate

Xiaoyu Tian, Haotian Wang et al.Jan 29arXiv

ASTRA is a fully automated way to train tool-using AI agents by making both their practice stories (trajectories) and their practice worlds (environments) without humans in the loop.

#tool-augmented agents#multi-turn decision making#verifiable environments

Not triaged yet

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

Intermediate

Yaorui Shi, Shugui Liu et al.Jan 29arXiv

MemOCR is a new way for AI to remember long histories by turning important notes into a picture with big, bold parts for key facts and tiny parts for details.

#MemOCR#visual memory#adaptive information density

Not triaged yet

42 43 44 45 46