RealMem is a new benchmark that tests how well AI assistants remember and manage long, ongoing projects across many conversations.
MemGovern teaches code agents to learn from past human fixes on GitHub by turning messy discussions into clean, reusable 'experience cards.'
ArenaRL teaches AI agents by comparing their answers against each other, like a sports tournament, instead of giving each answer a single noisy score.
LLMs can look confident but still change their answers when the surrounding text nudges them, showing that confidence alone isn’t real truthfulness.
Video models can now be told what physical result you want (like “make this ball move left with a strong push”) using Goal Force, instead of just vague text or a final picture.
The paper teaches an AI to act like a careful traveler: it looks at a photo, forms guesses about where it might be, and uses real map tools to check each guess.
This paper builds MFMD-Scen, a big test to see how AI changes its truth/false judgments about the same money-related claim when the situation around it changes.
This paper teaches a camera to fix nighttime colors by combining a smart rule-based color trick (SGP-LRD) with a learning-by-trying helper (reinforcement learning).
Re-Align is a new way for AI to make and edit pictures by thinking in clear steps before drawing.
This survey explains how AI judges are changing from single smart readers (LLM-as-a-Judge) into full-on agents that can plan, use tools, remember, and work in teams (Agent-as-a-Judge).
Big reasoning AIs think in many steps, which is slow and costly.
Long-term AI helpers remember past chats, but using all memories can trap them in old ideas (Memory Anchoring).