CLI-Gym is a new way to create lots of realistic computer-fixing tasks for AI by safely breaking and then repairing software environments inside containers.
FeatureBench is a new benchmark that tests AI coding agents on building real software features, not just fixing small bugs.
This paper builds a new audio tokenizer, called MOSS-Audio-Tokenizer, that turns sound into tiny tokens the way text tokenizers turn sentences into words.
Most image search systems judge each photo by itself, which fails when clues are split across many photos taken over time.
VESPO is a new, stable way to train language models with reinforcement learning even when training data comes from older or mismatched policies.
Decoder-only language models can be great at making user profiles (embeddings), but how we let them look at the sequence—called attention masking—changes how smart those profiles are.
Training big language models with reinforcement learning can wobble because the per-token importance-sampling (IS) ratios swing wildly.
Step 3.5 Flash is a huge but efficient AI that keeps 196 billion total parameters but only wakes up about 11 billion per token, so it thinks smart and fast.
Pictures can hide deeper meanings, like a wilted plant meaning someone feels burned out; most AI models miss these hints.
Long texts overwhelm many language models, which forget important bits and slow down as the context grows.
The paper fixes a common mistake in training language models for multi-part tasks: giving the same reward signal to every token, even when different text parts aim at different goals.
The paper introduces LT-Tuning, a way for AI models to “think silently” using special hidden tokens instead of writing every step out loud.