Papers1055

Latent Adversarial Regularization for Offline Preference Optimization

Enyi Jiang, Yibo Jacky Zhang et al.Jan 29arXiv

This paper introduces GANPO, a new way to train language models from human preferences by guiding the model using its hidden thoughts (latent space) instead of just its visible words (token space).

#GANPO#latent space regularization#offline preference optimization

VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

Intermediate

Yibo Wang, Yongcheng Jing et al.Jan 29arXiv

This paper shows a new way to help AI think through long problems faster by turning earlier text steps into small pictures the AI can reread.

#vision-text compression#optical memory#iterative reasoning

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Intermediate

Wenxuan Huang, Yu Zeng et al.Jan 29arXiv

The paper tackles a real problem: one-shot image or text searches often miss the right evidence (low hit-rate), especially in noisy, cluttered pictures.

#multimodal deep research#visual question answering#ReAct reasoning

MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

Intermediate

Baorui Ma, Jiahui Yang et al.Jan 29arXiv

Metric Anything is a new way to teach AI real, ruler-like distances (metric depth) from very mixed and noisy 3D data.

#metric depth estimation#sparse metric prompt#monocular depth

PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction

Intermediate

Changjian Jiang, Kerui Ren et al.Jan 29arXiv

PLANING is a new way to build 3D worlds from a moving single camera by combining two kinds of pieces: sharp triangles for shape and soft Gaussians for looks.

#Streaming 3D Reconstruction#Triangle Primitives#Neural Gaussians

CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

Intermediate

Johannes Kirmayr, Lukas Stappen et al.Jan 29arXiv

CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.

#LLM agents#benchmarking#consistency

Causal World Modeling for Robot Control

Intermediate

Lin Li, Qihang Zhang et al.Jan 29arXiv

Robots used to copy actions from videos without truly understanding how the world changes, so they often messed up long, multi-step jobs.

#robot world model#autoregressive diffusion#causal masking

Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

Intermediate

Shuangshuang Ying, Zheyu Wang et al.Jan 29arXiv

This paper builds a safe science “playground” called DeR that fairly tests how AI finds facts (retrieval) and how it thinks with those facts (reasoning) without mixing them up.

#retrieval-augmented generation#document-grounded reasoning#deep research benchmark

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Intermediate

Honglin Lin, Zheng Liu et al.Jan 29arXiv

MMFineReason is a huge, open dataset (1.8 million examples, 5.1 billion solution tokens) that teaches AIs to think step by step about pictures and text together.

#multimodal reasoning#vision-language models#chain-of-thought

Language-based Trial and Error Falls Behind in the Era of Experience

Intermediate

Haoyu Wang, Guozheng Ma et al.Jan 29arXiv

Big language models are great at words but waste lots of time and energy when they try random actions in non-language games like Sudoku, Sokoban, 2048, FrozenLake, and Rubik’s Cube.

#SCOUT#Reinforcement Learning#Supervised Fine-Tuning

DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning

Intermediate

Mingshuang Luo, Shuang Liang et al.Jan 29arXiv

DreamActor-M2 is a new way to make a still picture move by copying motion from a video while keeping the character’s look the same.

#character image animation#spatiotemporal in-context learning#video diffusion

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Intermediate

Yufeng Zhong, Lei Chen et al.Jan 29arXiv

OCRVerse is a new AI model that can read both plain text in documents and the visual structures in charts, webpages, and science plots, all in one system.

#Holistic OCR#Vision-Language Model#Supervised Fine-Tuning

34 35 36 37 38