This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.
Text-to-image models draw pretty pictures, but often put things in the wrong places or miss how objects interact.
TSRBench is a giant test that checks if AI models can understand and reason about data that changes over time, like heartbeats, stock prices, and weather.
The paper finds almost 300 accepted NLP papers (mostly in 2025) that include at least one fake or non-existent reference, which the authors call a HalluCitation.
LLM agents are usually trained in a few worlds but asked to work in many different, unseen worlds, which often hurts their performance.
VIBEVOICE-ASR is a single-pass system that listens to up to 60 minutes of audio at once and outputs who spoke, when they spoke, and what they said in one stream.
Typhoon-S is a simple, open recipe that turns a basic language model into a helpful assistant and then teaches it important local skills, all on small budgets.
Transformers slow down on very long inputs because standard attention looks at every token pair, which is expensive.
LongCat-Flash-Thinking-2601 is a huge 560-billion-parameter Mixture-of-Experts model built to act like a careful helper that can use tools, browse, code, and solve multi-step tasks.
DSGym is a unified 'gym' where AI data science agents are tested and trained by actually running code on real datasets, not just chatting about them.
IVRA is a simple, training-free add-on that helps robot brains keep the 2D shape of pictures while following language instructions.
This paper shows that giving an AI a safe, tiny virtual computer (a sandbox) lets it solve many kinds of problems better, not just coding ones.