Reinforcement learning (RL) trains language models by letting them try answers and learn from rewards, but training is slow if we pick the wrong practice questions.
The paper studies why large language models (LLMs) sound too sure of themselves when using retrieval-augmented generation (RAG) and how to fix it.