Big idea: use a small, already-trained model to help a bigger model learn good habits early, so the big one trains faster and ends up smarter.
This paper teaches AI to pay attention better by training its focus, not just its words.
The paper studies how to teach a smaller language model using a bigger one by only focusing on the most useful bits instead of everything.
This paper shows how to turn a big Transformer model into a faster hybrid model that mixes attention and RNN layers using far less training data (about 2.3B tokens).
This paper tackles dataset distillation by giving a clear, math-backed way to keep only the most useful bits of data, so models can learn well from far fewer images.
The paper asks a simple question: Which step-by-step explanations from a teacher model actually help a student model learn to reason better?
This paper builds an AI agent, ML-Master 2.0, that can work on machine learning projects for a very long time without forgetting what matters.
LaViT is a new way to teach smaller vision-language models to look at the right parts of an image before they speak.
The paper introduces DASD-4B-Thinking, a small (4B) open-source reasoning model that scores like much larger models on hard math, science, and coding tests.
This paper builds two teamwork models, Qwen3-VL-Embedding and Qwen3-VL-Reranker, that understand text, images, visual documents, and videos in one shared space so search works across all of them.
TimeBill is a way to help big AI models finish their answers on time without ruining answer quality.
Big vision-language models are super smart but too large to fit on phones and small devices.