This paper says long chain-of-thought (Long CoT) works best when it follows a 'molecular' pattern with three kinds of thinking bonds: Deep-Reasoning, Self-Reflection, and Self-Exploration.
This paper introduces LAMER, a Meta-RL training framework that teaches language agents to explore first and then use what they learned to solve tasks faster.