🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers13

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#knowledge distillation

Reinforced Attention Learning

Intermediate
Bangzheng Li, Jianmo Ni et al.Feb 4arXiv

This paper teaches AI to pay attention better by training its focus, not just its words.

#Reinforced Attention Learning#attention policy#multimodal LLM

Rethinking Selective Knowledge Distillation

Intermediate
Almog Tavor, Itay Ebenspanger et al.Feb 1arXiv

The paper studies how to teach a smaller language model using a bigger one by only focusing on the most useful bits instead of everything.

#knowledge distillation#selective distillation#student entropy

Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

Intermediate
Yingfa Chen, Zhen Leng Thai et al.Jan 29arXiv

This paper shows how to turn a big Transformer model into a faster hybrid model that mixes attention and RNN layers using far less training data (about 2.3B tokens).

#hybrid attention#RNN attention hybrid#linear attention

Grounding and Enhancing Informativeness and Utility in Dataset Distillation

Intermediate
Shaobo Wang, Yantai Yang et al.Jan 29arXiv

This paper tackles dataset distillation by giving a clear, math-backed way to keep only the most useful bits of data, so models can learn well from far fewer images.

#dataset distillation#data condensation#Shapley value

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Intermediate
Yuming Yang, Mingyoung Lai et al.Jan 20arXiv

The paper asks a simple question: Which step-by-step explanations from a teacher model actually help a student model learn to reason better?

#Rank-Surprisal Ratio#data-student suitability#chain-of-thought distillation

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Intermediate
Xinyu Zhu, Yuzhu Cai et al.Jan 15arXiv

This paper builds an AI agent, ML-Master 2.0, that can work on machine learning projects for a very long time without forgetting what matters.

#Hierarchical Cognitive Caching#cognitive accumulation#ultra-long-horizon autonomy

LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Intermediate
Linquan Wu, Tianxiang Jiang et al.Jan 15arXiv

LaViT is a new way to teach smaller vision-language models to look at the right parts of an image before they speak.

#multimodal reasoning#visual attention#knowledge distillation

Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

Intermediate
Shaotian Yan, Kaiyuan Liu et al.Jan 14arXiv

The paper introduces DASD-4B-Thinking, a small (4B) open-source reasoning model that scores like much larger models on hard math, science, and coding tests.

#sequence-level distillation#divergence-aware sampling#temperature-scheduled learning

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Intermediate
Mingxin Li, Yanzhao Zhang et al.Jan 8arXiv

This paper builds two teamwork models, Qwen3-VL-Embedding and Qwen3-VL-Reranker, that understand text, images, visual documents, and videos in one shared space so search works across all of them.

#multimodal retrieval#unified embedding space#cross-encoder reranker

TimeBill: Time-Budgeted Inference for Large Language Models

Intermediate
Qi Fan, An Zou et al.Dec 26arXiv

TimeBill is a way to help big AI models finish their answers on time without ruining answer quality.

#time-budgeted inference#response length prediction#execution time estimation

Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

Intermediate
Byung-Kwan Lee, Yu-Chiang Frank Wang et al.Dec 23arXiv

Big vision-language models are super smart but too large to fit on phones and small devices.

#vision-language models#knowledge distillation#masking teacher

Improving Recursive Transformers with Mixture of LoRAs

Intermediate
Mohammadmahdi Nouriborji, Morteza Rohanian et al.Dec 14arXiv

Recursive transformers save memory by reusing the same layer over and over, but that makes them less expressive and hurts accuracy.

#Mixture of LoRAs#recursive transformers#parameter sharing
12