🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers9

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#instruction tuning

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Intermediate
Qian Chen, Jinlan Fu et al.Jan 20arXiv

FutureOmni is the first benchmark that tests if multimodal AI models can predict what happens next from both sound and video, not just explain what already happened.

#multimodal LLM#audio-visual reasoning#future forecasting

TranslateGemma Technical Report

Intermediate
Mara Finkelstein, Isaac Caswell et al.Jan 13arXiv

TranslateGemma is a family of open machine translation models fine-tuned from Gemma 3 to translate many languages more accurately.

#machine translation#TranslateGemma#Gemma 3

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

Intermediate
Kai Liu, Jungang Li et al.Dec 28arXiv

JavisGPT is a single AI that can both understand sounding videos (audio + video together) and also create new ones that stay in sync.

#multimodal large language model#audio-video synchronization#SyncFusion

Streaming Video Instruction Tuning

Intermediate
Jiaer Xia, Peixian Chen et al.Dec 24arXiv

Streamo is a real-time video assistant that knows when to stay quiet, when to wait, and when to speak—while a video is still playing.

#streaming video LLM#real-time video understanding#instruction tuning

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

Intermediate
Mengzhang Cai, Xin Gao et al.Dec 16arXiv

OpenDataArena (ODA) is a fair, open platform that measures how valuable different post‑training datasets are for large language models by holding everything else constant.

#OpenDataArena#post-training datasets#data-centric AI

DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

Intermediate
Zhenyang Cai, Jiaming Zhang et al.Dec 12arXiv

DentalGPT is a special AI that looks at dental images and text together and explains what it sees like a junior dentist.

#DentalGPT#multimodal large language model#dentistry AI

Insight Miner: A Time Series Analysis Dataset for Cross-Domain Alignment with Natural Language

Intermediate
Yunkai Zhang, Yawen Zhang et al.Dec 12arXiv

Time-series data are numbers tracked over time, like temperature each hour or traffic each day, and turning them into clear words usually needs experts.

#time series#multimodal model#trend description

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

Intermediate
Tiwei Bie, Maosong Cao et al.Dec 10arXiv

Before this work, most big language models talked one word at a time (autoregressive), which made them slow and hard to parallelize.

#diffusion language model#masked diffusion#block diffusion

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Intermediate
Hongyuan Tao, Bencheng Liao et al.Dec 9arXiv

InfiniteVL is a vision-language model that mixes two ideas: local focus with Sliding Window Attention and long-term memory with a linear module called Gated DeltaNet.

#InfiniteVL#linear attention#Gated DeltaNet