Papers925

SOP: A Scalable Online Post-Training System for Vision-Language-Action Models

Mingjie Pan, Siyuan Feng et al.Jan 6arXiv

This paper introduces SOP, a system that lets many real robots learn new skills online at the same time while keeping one shared brain (policy).

#Scalable Online Post-Training#Vision-Language-Action Models#Actor–Learner Architecture

MMFormalizer: Multimodal Autoformalization in the Wild

Intermediate

Jing Xiong, Qi Han et al.Jan 6arXiv

MMFormalizer is a new system that turns problems with pictures and words (like physics scenes or geometry diagrams) into strict, checkable math statements and proofs.

#multimodal autoformalization#Lean theorem prover#dimensional grounding

Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts

Intermediate

Dhruv Trehan, Paras ChopraJan 6arXiv

The authors built a simple six-agent system to see if today’s AI models could plan, run, and write a research paper mostly on their own.

#autonomous research pipeline#implementation drift#training data bias

Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners

Beginner

Yihong Liu, Raoyuan Zhao et al.Jan 6arXiv

Large reasoning models can often find the right math answer in their “head” before finishing their written steps, but this works best in languages with lots of training data like English and Chinese.

#latent reasoning#chain-of-thought#multilingual LLMs

Mechanistic Interpretability of Large-Scale Counting in LLMs through a System-2 Strategy

Intermediate

Hosein Hasani, Mohammadali Banayeeanzade et al.Jan 6arXiv

Large language models (LLMs) are good at many math problems but often mess up simple counting when the list gets long.

#mechanistic interpretability#counting in LLMs#System-2 prompting

DreamStyle: A Unified Framework for Video Stylization

Intermediate

Mengtian Li, Jinshu Chen et al.Jan 6arXiv

DreamStyle is a single video-stylizing model that can follow text, copy a style image, or continue from a stylized first frame—without switching tools.

#video stylization#image-to-video (I2V)#token-specific LoRA

MiMo-V2-Flash Technical Report

Intermediate

Xiaomi LLM-Core Team, : et al.Jan 6arXiv

MiMo-V2-Flash is a giant but efficient language model that uses a team-of-experts design to think well while staying fast.

#Mixture-of-Experts#Sliding Window Attention#Global Attention

AnyDepth: Depth Estimation Made Easy

Intermediate

Zeyu Ren, Zeyu Zhang et al.Jan 6arXiv

AnyDepth is a new, simple way for a computer to tell how far things are in a picture using just one image (monocular depth).

#monocular depth estimation#zero-shot depth#Simple Depth Transformer

SimpleMem: Efficient Lifelong Memory for LLM Agents

Intermediate

Jiaqi Liu, Yaofeng Su et al.Jan 5arXiv

SimpleMem is a new memory system that helps AI remember long conversations without wasting space or tokens.

#LLM memory#semantic compression#online synthesis

VINO: A Unified Visual Generator with Interleaved OmniModal Context

Beginner

Junyi Chen, Tong He et al.Jan 5arXiv

VINO is a single AI model that can make and edit both images and videos by listening to text and looking at reference pictures and clips at the same time.

#VINO#unified visual generator#multimodal diffusion transformer

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Intermediate

Jing Tan, Zhaoyang Zhang et al.Jan 5arXiv

Talk2Move is a training recipe that lets an image editor move, rotate, and resize the exact object you mention using plain text, while keeping the rest of the picture stable.

#text-guided image editing#object-level transformation#reinforcement learning

Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling

Beginner

Falcon LLM Team, Iheb Chaabane et al.Jan 5arXiv

Falcon-H1R is a small (7B) AI model that thinks really well without needing giant computers.

#Falcon-H1R#Hybrid Transformer-Mamba#Chain-of-Thought

43 44 45 46 47