🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers791

AllBeginnerIntermediateAdvanced
All SourcesarXiv

MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

Intermediate
Hanzhang Zhou, Xu Zhang et al.Dec 26arXiv

MAI-UI is a family of AI agents that can see, understand, and control phone and computer screens using plain language.

#GUI agent#GUI grounding#mobile navigation

SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

Intermediate
Shaofei Cai, Yulei Qin et al.Dec 26arXiv

SmartSnap teaches an agent not only to finish a phone task but also to prove it with a few perfect snapshots it picks itself.

#Self-verifying agents#Evidence curation#3C principles

SWE-RM: Execution-free Feedback For Software Engineering Agents

Intermediate
KaShun Shum, Binyuan Hui et al.Dec 26arXiv

Coding agents used to fix software rely on feedback; unit tests give only pass/fail signals that are often noisy or missing.

#execution-free feedback#reward model#software engineering agents

TimeBill: Time-Budgeted Inference for Large Language Models

Intermediate
Qi Fan, An Zou et al.Dec 26arXiv

TimeBill is a way to help big AI models finish their answers on time without ruining answer quality.

#time-budgeted inference#response length prediction#execution time estimation

Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

Intermediate
Mengqi He, Xinyu Tian et al.Dec 26arXiv

The paper shows that when vision-language models write captions, only a small set of uncertain words (about 20%) act like forks that steer the whole sentence.

#vision-language models#autoregressive generation#entropy

Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation

Intermediate
Steven Xiao, Xindi Zhang et al.Dec 25arXiv

This paper introduces Knot Forcing, a way to make talking-head videos that look great while being generated live, frame by frame.

#Knot Forcing#autoregressive video diffusion#temporal knot

An Information Theoretic Perspective on Agentic System Design

Intermediate
Shizhe He, Avanika Narayan et al.Dec 25arXiv

The paper shows that many AI systems work best when a small 'compressor' model first shrinks long text into a short, info-packed summary and a bigger 'predictor' model then reasons over that summary.

#agentic systems#compressor-predictor#mutual information

UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture

Intermediate
Shuo Cao, Jiayang Li et al.Dec 25arXiv

This paper teaches AI to notice not just what is in a picture, but how the picture looks and feels to people.

#perceptual image understanding#image aesthetics assessment (IAA)#image quality assessment (IQA)

HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

Intermediate
Haonan Qiu, Shikun Liu et al.Dec 24arXiv

HiStream makes 1080p video generation much faster by removing repeated work across space, time, and steps.

#high-resolution video generation#diffusion transformer (DiT)#dual-resolution caching

Streaming Video Instruction Tuning

Intermediate
Jiaer Xia, Peixian Chen et al.Dec 24arXiv

Streamo is a real-time video assistant that knows when to stay quiet, when to wait, and when to speak—while a video is still playing.

#streaming video LLM#real-time video understanding#instruction tuning

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Intermediate
Jiawei Liu, Junqiao Li et al.Dec 24arXiv

DreaMontage is a new AI method that makes long, single-shot videos that feel smooth and connected, even when you give it scattered images or short clips in the middle.

#arbitrary frame conditioning#one-shot video generation#Diffusion Transformer

Latent Implicit Visual Reasoning

Intermediate
Kelvin Li, Chuyi Shang et al.Dec 24arXiv

Large Multimodal Models (LMMs) are great at reading text and looking at pictures, but they usually do most of their thinking in words, which limits deep visual reasoning.

#Latent Implicit Visual Reasoning#latent tokens#bottleneck attention masking
4445464748