Papers30

All Beginner Intermediate Advanced

All Sources arXiv

#reinforcement learning

Reasoning Models Struggle to Control their Chains of Thought

Beginner

Chen Yueh-Han, Robert McCarthy et al.Mar 5arXiv

The paper studies whether AI models can hide or reshape their step-by-step thoughts (chains of thought) on command.

#chain-of-thought#controllability#monitorability

Not triaged yet

KARL: Knowledge Agents via Reinforcement Learning

Beginner

Jonathan D. Chang, Andrew Drozdov et al.Mar 5arXiv

KARL is a smart search helper that learns to look up information step by step and explain answers using the facts it finds.

#grounded reasoning#enterprise search#reinforcement learning

Not triaged yet

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Beginner

Zhenting Wang, Huancheng Chen et al.Mar 4arXiv

This paper teaches long-horizon AI agents to remember everything exactly without stuffing their whole memory at once.

#indexed memory#LLM agents#long-horizon tasks

Not triaged yet

MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning

Beginner

Jiachun Li, Shaoping Huang et al.Mar 2arXiv

MMR-Life is a new test (benchmark) that checks how AI understands everyday situations using several real photos at once.

#multimodal reasoning#multi-image understanding#real-life benchmark

Not triaged yet

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

Beginner

Xinyu Zhu, Yihao Feng et al.Mar 1arXiv

CHIMERA is a small (about 9,000 examples) but very carefully built synthetic dataset that teaches AI to solve hard problems step by step.

#CHIMERA dataset#synthetic data generation#chain-of-thought

Not triaged yet

Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

Beginner

Chris Samarinas, Haw-Shiuan Chang et al.Feb 26arXiv

SLATE is a new way to teach AI to think step by step while using a search engine, giving feedback at each step instead of only at the end.

#retrieval-augmented reasoning#reinforcement learning#GRPO

Not triaged yet

WorldCompass: Reinforcement Learning for Long-Horizon World Models

Beginner

Zehan Wang, Tengfei Wang et al.Feb 9arXiv

WorldCompass teaches video world models to follow actions better and keep pictures pretty by using reinforcement learning after pretraining.

#world models#reinforcement learning#clip-level rollout

Not triaged yet

LLaDA2.1: Speeding Up Text Diffusion via Token Editing

Beginner

Tiwei Bie, Maosong Cao et al.Feb 9arXiv

LLaDA2.1 teaches a diffusion-style language model to write fast rough drafts and then fix its own mistakes by editing tokens it already wrote.

#discrete diffusion language model#editable decoding#token-to-token editing

Not triaged yet

RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

Beginner

Yinjie Wang, Tianbao Xie et al.Feb 2arXiv

RLAnything is a new reinforcement learning (RL) framework that trains three things together at once: the policy (the agent), the reward model (the judge), and the environment (the tasks).

#reinforcement learning#closed-loop optimization#reward modeling

Not triaged yet

Kimi K2.5: Visual Agentic Intelligence

Beginner

Kimi Team, Tongtong Bai et al.Feb 2arXiv

Kimi K2.5 is a new open-source AI that can read both text and visuals (images and videos) and act like a team of helpers to finish big tasks faster.

#multimodal learning#vision-language models#joint optimization

Not triaged yet

Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report

Beginner

Zhuoran Yang, Ed Li et al.Jan 28arXiv

This paper introduces Foundation-Sec-8B-Reasoning, a small (8 billion parameter) AI model that is trained to “think out loud” before answering cybersecurity questions.

#native reasoning#cybersecurity LLM#chain-of-thought

Not triaged yet

Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents

Beginner

Zhihan Liu, Lin Guan et al.Jan 26arXiv

LLM agents are usually trained in a few worlds but asked to work in many different, unseen worlds, which often hurts their performance.

#cross-domain generalization#state information richness#planning complexity

Not triaged yet

1 2 3