Papers9

#AIME

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

CHIMERA is a small (about 9,000 examples) but very carefully built synthetic dataset that teaches AI to solve hard problems step by step.

#CHIMERA dataset#synthetic data generation#chain-of-thought

Not triaged yet

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Intermediate

Xin Xu, Clive Bai et al.Feb 12arXiv

This paper shows a simple way to turn many 'too-easy' questions into harder, still-checkable ones so that AI keeps learning instead of stalling.

#Reinforcement Learning with Verifiable Rewards#Compositional prompts#Sequential Prompt Composition

Not triaged yet

Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

Intermediate

Dawid J. Kopiczko, Sagar Vaze et al.Feb 11arXiv

The paper shows that, when teaching a reasoning AI with step-by-step examples, repeating a small set many times can beat using a huge set only once.

#Supervised Fine-Tuning#Chain-of-Thought#Data Repetition

Not triaged yet

Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

Intermediate

Xiao Liang, Zhong-Zhi Li et al.Feb 2arXiv

The paper trains language models to solve hard problems by first breaking them into smaller parts and then solving those parts, instead of only thinking in one long chain.

#divide-and-conquer reasoning#chain-of-thought#reinforcement learning

Not triaged yet

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Intermediate

Yanqi Dai, Yuxiang Ji et al.Jan 28arXiv

This paper says that to make math-solving AIs smarter, we should train them more on the hardest questions they can almost solve.

#Mathematical reasoning#RLVR#GRPO

Not triaged yet

MiMo-V2-Flash Technical Report

Intermediate

Xiaomi LLM-Core Team, : et al.Jan 6arXiv

MiMo-V2-Flash is a giant but efficient language model that uses a team-of-experts design to think well while staying fast.

#Mixture-of-Experts#Sliding Window Attention#Global Attention

Not triaged yet

Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling

Beginner

Falcon LLM Team, Iheb Chaabane et al.Jan 5arXiv

Falcon-H1R is a small (7B) AI model that thinks really well without needing giant computers.

#Falcon-H1R#Hybrid Transformer-Mamba#Chain-of-Thought

Not triaged yet

Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Intermediate

Peter Chen, Xiaopeng Li et al.Dec 18arXiv

The paper studies why two opposite-sounding tricks in RL for reasoning—adding random (spurious) rewards and reducing randomness (entropy)—can both seem to help large language models think better.

#RLVR#Group Relative Policy Optimization#ratio clipping

Not triaged yet

Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory

Beginner

Mirac Suzgun, Mert Yuksekgonul et al.Apr 10arXiv

The paper introduces Dynamic Cheatsheet (DC), a simple way for language models to keep a tiny, smart notebook of useful tricks while they are being used.

#Dynamic Cheatsheet#test-time learning#memory curation

Not triaged yet