STEP3-VL-10B is a small (10 billion parameters) open multimodal model that sees images and reads text, yet scores like much larger models.
Traditional supervised fine-tuning (SFT) makes a model copy one answer too exactly, which can cause overfitting to the exact wording instead of the real idea.
The paper introduces Entropy Sentinel, a simple way to watch how accurate an AI is by reading its “uncertainty heartbeat” during generation.
Ministral 3 is a new family of small-but-mighty AI language models (3B, 8B, 14B) that learn from a larger model using a step-by-step tutoring method called Cascade Distillation.
Giving large language models a few good examples and step-by-step instructions can make them much better at spotting feelings in text.
The paper introduces Trainee-Bench, a new way to test AI agents that feels like a real first day at work, with tasks arriving over time, hidden clues, and changing priorities.
This paper studies how AI agents that use tools talk about how sure they are and finds a split: some tools make them too sure, others help them be honest.
RealMem is a new benchmark that tests how well AI assistants remember and manage long, ongoing projects across many conversations.
MemGovern teaches code agents to learn from past human fixes on GitHub by turning messy discussions into clean, reusable 'experience cards.'
ArenaRL teaches AI agents by comparing their answers against each other, like a sports tournament, instead of giving each answer a single noisy score.
LLMs can look confident but still change their answers when the surrounding text nudges them, showing that confidence alone isn’t real truthfulness.
Video models can now be told what physical result you want (like “make this ball move left with a strong push”) using Goal Force, instead of just vague text or a final picture.