NanoKnow is a new benchmark that checks whether a language modelβs answers come from what it saw during training or from extra text we give it at question time.
KAGE-Bench is a fast, carefully controlled benchmark that tests how well reinforcement learning (RL) agents trained on pixels handle specific visual changes, like new backgrounds or lighting, without changing the actual game rules.