LOCA-bench is a test that challenges AI agents to work correctly as their to-do list and background information grow very, very long.
The paper shows that many AI systems work best when a small 'compressor' model first shrinks long text into a short, info-packed summary and a bigger 'predictor' model then reasons over that summary.