OCRVerse is a new AI model that can read both plain text in documents and the visual structures in charts, webpages, and science plots, all in one system.
This paper explains how to turn large language models (LLMs) from quiet students that only answer questions into active agents that can plan, act, and learn over time.
Big language models can learn new facts with simple tutoring (SFT), but that doesn’t automatically teach them how to use those facts well.
MatchTIR teaches AI agents to judge each tool call step-by-step instead of giving the same reward to every step.
SmartSearch teaches search agents to fix their own bad search queries while they are thinking, not just their final answers.
This paper teaches a model to turn a question about a table into both a short answer and a clear, correct chart.
MiMo-V2-Flash is a giant but efficient language model that uses a team-of-experts design to think well while staying fast.
Visual Autoregressive (VAR) models draw whole grids of image tokens at once across multiple scales, which makes standard reinforcement learning (RL) unstable.
NextFlow is a single, decoder-only Transformer that can read and write both text and images in one continuous sequence.
MDAgent2 is a special helper built from large language models (LLMs) that can both answer questions about molecular dynamics and write runnable LAMMPS simulation code.
K-EXAONE is a super-sized language model that speaks six languages and can read very long documents (up to 256,000 tokens) without forgetting important details.
CPPO is a new way to fine‑tune vision‑language models so they see pictures more accurately before they start to reason.