🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers6

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#reasoning efficiency

On-Policy Self-Distillation for Reasoning Compression

Beginner
Hejian Sang, Yuanda Xu et al.Mar 5arXiv

Reasoning models often talk too much, and those extra words can actually make them more wrong.

#on-policy self-distillation#reasoning compression#conciseness instruction

From Scale to Speed: Adaptive Test-Time Scaling for Image Editing

Intermediate
Xiangyan Qu, Zhenlong Yuan et al.Feb 24arXiv

This paper speeds up and improves AI image editing by giving hard edits more attention and easy edits less, just like a smart coach.

#adaptive test-time scaling#image chain-of-thought#image editing

Latent Thoughts Tuning: Bridging Context and Reasoning with Fused Information in Latent Tokens

Intermediate
Weihao Liu, Dehai Min et al.Feb 10arXiv

The paper introduces LT-Tuning, a way for AI models to “think silently” using special hidden tokens instead of writing every step out loud.

#latent tokens#chain-of-thought#context-prediction fusion

MAXS: Meta-Adaptive Exploration with LLM Agents

Intermediate
Jian Zhang, Zhiyuan Wang et al.Jan 14arXiv

MAXS is a new way for AI agents to think a few steps ahead while using tools like search and code, so they make smarter choices.

#LLM agents#tool-augmented reasoning#lookahead

JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

Intermediate
Jiangshan Duo, Hanyu Li et al.Jan 13arXiv

JudgeRLVR teaches a model to be a strict judge of answers before it learns to generate them, which trims bad ideas early.

#RLVR#judge-then-generate#discriminative supervision

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Intermediate
Shih-Yang Liu, Xin Dong et al.Jan 8arXiv

When a model learns from many rewards at once, a popular method called GRPO can accidentally squash different reward mixes into the same learning signal, which confuses training.

#GDPO#GRPO#multi-reward reinforcement learning