ERNIE 5.0 is a single giant model that can read and create text, images, video, and audio by predicting the next pieces step by step, like writing a story one line at a time.
WideSeek-R1 teaches a small 4B-parameter language model to act like a well-run team: one leader plans, many helpers work in parallel, and everyone learns together with reinforcement learning.
This paper says today's content AIs are great at pretty pictures and videos but often miss what people actually want, creating a big Intent-Execution Gap.
Long texts make language models slow because they must keep and re-check a huge memory called the KV cache for every new word they write.
EgoActor is a vision-language model that turns everyday instructions like 'Go to the door and say hi' into step-by-step, egocentric actions a humanoid robot can actually do.
The paper teaches multimodal large language models (MLLMs) to stop guessing from just text or just images and instead check both together before answering.
Agent-Omit teaches AI agents to skip unneeded thinking and old observations, cutting tokens while keeping accuracy high.
The paper tackles a common problem: people can ask AI to do big, complex tasks, but they can’t always explain exactly what they want or check the results well.
Multimodal Process Reward Models (MPRMs) teach AI to judge each step of a picture-and-text reasoning process, not just the final answer.
Binary right/wrong rewards for training reasoning in large language models are hard to design and often too sparse to learn from.
Robots often learn good hand motions during training but get confused when the scene or the instructions change at test time, even a little bit.
AgentArk teaches one language model to think like a whole team of models that debate, so it can solve tough problems quickly without running a long, expensive debate at answer time.