🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers784

AllBeginnerIntermediateAdvanced
All SourcesarXiv

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Intermediate
Haowei Zhang, Shudong Yang et al.Jan 21arXiv

HERMES is a training-free way to make video-language models understand live, streaming video quickly and accurately.

#HERMES#KV cache#hierarchical memory

GutenOCR: A Grounded Vision-Language Front-End for Documents

Intermediate
Hunter Heidenreich, Ben Elliott et al.Jan 20arXiv

GutenOCR turns a general vision-language model into a single, smart OCR front-end that can read, find, and point to text on a page using simple prompts.

#grounded OCR#vision-language model#document understanding

Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis

Intermediate
Thanathai Lertpetchpun, Yoonjeong Lee et al.Jan 20arXiv

The paper shows how to control accents in text-to-speech (TTS) by mixing simple, linguistics-based sound-change rules with speaker embeddings.

#text-to-speech#accent control#phonological rules

Implicit Neural Representation Facilitates Unified Universal Vision Encoding

Intermediate
Matthew Gwilliam, Xiao Wang et al.Jan 20arXiv

This paper introduces HUVR, a single vision model that can both recognize what’s in an image and reconstruct or generate images from tiny codes.

#Implicit Neural Representation#Hyper-Networks#Vision Transformer

VideoMaMa: Mask-Guided Video Matting via Generative Prior

Intermediate
Sangbeom Lim, Seoung Wug Oh et al.Jan 20arXiv

VideoMaMa is a model that turns simple black-and-white object masks into soft, precise cutouts (alpha mattes) for every frame of a video.

#video matting#alpha matte#binary segmentation mask

Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis

Intermediate
Hongyuan Chen, Xingyu Chen et al.Jan 20arXiv

Motion 3-to-4 turns a single regular video into a moving 3D object over time (a 4D asset) by first getting the object’s shape and then figuring out how every part moves.

#4D synthesis#monocular video#motion reconstruction

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Intermediate
Said Taghadouini, Adrien Cavaillès et al.Jan 20arXiv

LightOnOCR-2-1B is a single, compact AI model that reads PDF pages and scans and turns them into clean, well-ordered text without using fragile multi-step OCR pipelines.

#LightOnOCR-2-1B#end-to-end OCR#vision-language model

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

Intermediate
Pengze Zhang, Yanze Wu et al.Jan 20arXiv

OmniTransfer is a single system that learns from a whole reference video, not just one image, so it can copy how things look (identity and style) and how they move (motion, camera, effects).

#spatio-temporal video transfer#identity transfer#style transfer

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Intermediate
Yuming Yang, Mingyoung Lai et al.Jan 20arXiv

The paper asks a simple question: Which step-by-step explanations from a teacher model actually help a student model learn to reason better?

#Rank-Surprisal Ratio#data-student suitability#chain-of-thought distillation

Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

Intermediate
Haocheng Xi, Charlie Ruan et al.Jan 20arXiv

Reinforcement learning (RL) for large language models is slow because the rollout (text generation) stage can take more than 70% of training time, especially for long, step-by-step answers.

#FP8 quantization#on-policy reinforcement learning#precision flow

KAGE-Bench: Fast Known-Axis Visual Generalization Evaluation for Reinforcement Learning

Intermediate
Egor Cherepanov, Daniil Zelezetsky et al.Jan 20arXiv

KAGE-Bench is a fast, carefully controlled benchmark that tests how well reinforcement learning (RL) agents trained on pixels handle specific visual changes, like new backgrounds or lighting, without changing the actual game rules.

#reinforcement learning#visual generalization#KAGE-Env

InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning

Intermediate
Matthew Y. R. Yang, Hao Bai et al.Jan 20arXiv

The paper introduces Intervention Training (InT), a simple way for a language model to find and fix the first wrong step in its own reasoning using a short, targeted correction.

#Intervention Training#credit assignment#LLM reasoning
2122232425