Large language models can quietly pick up hidden preferences from training data that looks harmless.
ERNIE 5.0 is a single giant model that can read and create text, images, video, and audio by predicting the next pieces step by step, like writing a story one line at a time.
WideSeek-R1 teaches a small 4B-parameter language model to act like a well-run team: one leader plans, many helpers work in parallel, and everyone learns together with reinforcement learning.
The paper finds a strange gap: the model’s hidden thoughts almost perfectly show when it should use a tool, but its actual words often don’t trigger the tool under strict rules.
Long texts make language models slow because they must keep and re-check a huge memory called the KV cache for every new word they write.
EgoActor is a vision-language model that turns everyday instructions like 'Go to the door and say hi' into step-by-step, egocentric actions a humanoid robot can actually do.
The paper teaches multimodal large language models (MLLMs) to stop guessing from just text or just images and instead check both together before answering.
The paper tries several different ways to translate five low-resource Turkic languages, instead of forcing one method to fit all.
Most language models are trained on compressed tokens, which makes training fast but ties the model to a specific tokenizer.
Agent-Omit teaches AI agents to skip unneeded thinking and old observations, cutting tokens while keeping accuracy high.
The paper tackles a common problem: people can ask AI to do big, complex tasks, but they can’t always explain exactly what they want or check the results well.
Multimodal Process Reward Models (MPRMs) teach AI to judge each step of a picture-and-text reasoning process, not just the final answer.