ERNIE 5.0 is a single giant model that can read and create text, images, video, and audio by predicting the next pieces step by step, like writing a story one line at a time.
WideSeek-R1 teaches a small 4B-parameter language model to act like a well-run team: one leader plans, many helpers work in parallel, and everyone learns together with reinforcement learning.
Long texts make language models slow because they must keep and re-check a huge memory called the KV cache for every new word they write.
EgoActor is a vision-language model that turns everyday instructions like 'Go to the door and say hi' into step-by-step, egocentric actions a humanoid robot can actually do.
The paper teaches multimodal large language models (MLLMs) to stop guessing from just text or just images and instead check both together before answering.
Agent-Omit teaches AI agents to skip unneeded thinking and old observations, cutting tokens while keeping accuracy high.
The paper tackles a common problem: people can ask AI to do big, complex tasks, but they can’t always explain exactly what they want or check the results well.
Multimodal Process Reward Models (MPRMs) teach AI to judge each step of a picture-and-text reasoning process, not just the final answer.
Robots often learn good hand motions during training but get confused when the scene or the instructions change at test time, even a little bit.
AgentArk teaches one language model to think like a whole team of models that debate, so it can solve tough problems quickly without running a long, expensive debate at answer time.
Parallel-Probe is a simple add-on that lets many AI “thought paths” think at once but stop early when they already agree.
AutoFigure is an AI system that reads long scientific texts and then thinks, plans, and draws clear, good-looking figures—like a careful student who makes a neat, accurate poster from a long chapter.