🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers943

AllBeginnerIntermediateAdvanced
All SourcesarXiv

C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

Beginner
Jin Qin, Zihan Liao et al.Dec 24arXiv

C2LLM is a new family of code embedding models that helps computers find the right code faster and more accurately.

#code retrieval#embedding model#cross-attention pooling

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Intermediate
Jiawei Liu, Junqiao Li et al.Dec 24arXiv

DreaMontage is a new AI method that makes long, single-shot videos that feel smooth and connected, even when you give it scattered images or short clips in the middle.

#arbitrary frame conditioning#one-shot video generation#Diffusion Transformer

Latent Implicit Visual Reasoning

Intermediate
Kelvin Li, Chuyi Shang et al.Dec 24arXiv

Large Multimodal Models (LMMs) are great at reading text and looking at pictures, but they usually do most of their thinking in words, which limits deep visual reasoning.

#Latent Implicit Visual Reasoning#latent tokens#bottleneck attention masking

UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement

Intermediate
Tanghui Jia, Dongyu Yan et al.Dec 24arXiv

UltraShape 1.0 is a two-step 3D generator that first makes a simple overall shape and then zooms in to add tiny details.

#3D diffusion#coarse-to-fine generation#voxel-based refinement

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Beginner
Zhe Cao, Tao Wang et al.Dec 24arXiv

T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.

#Text-to-Audio-Video generation#multimodal evaluation#cross-modal alignment

Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Beginner
Jinghan Li, Yang Jin et al.Dec 24arXiv

This paper introduces NExT-Vid, a way to teach a video model by asking it to guess the next frame of a video while parts of the past are hidden.

#autoregressive video pretraining#masked next-frame prediction#context isolation

Quantile Rendering: Efficiently Embedding High-dimensional Feature on 3D Gaussian Splatting

Intermediate
Yoonwoo Jeong, Cheng Sun et al.Dec 24arXiv

This paper speeds up how 3D scenes handle big, 512‑dimensional features without throwing away important information.

#3D Gaussian Splatting#Quantile Rendering#Open-vocabulary segmentation

NVIDIA Nemotron 3: Efficient and Open Intelligence

Intermediate
NVIDIA, : et al.Dec 24arXiv

Nemotron 3 is a new family of open AI models (Nano, Super, Ultra) built to think better while running faster and cheaper.

#Nemotron 3#Mixture-of-Experts#LatentMoE

Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Intermediate
NVIDIA, : et al.Dec 23arXiv

Nemotron 3 Nano is a new open-source language model that mixes two brain styles (Mamba and Transformer) and adds a team of special experts (MoE) so it thinks better while running much faster.

#Mixture-of-Experts#Mamba-2#Transformer

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Beginner
Gül Sena Altıntaş, Malikeh Ehghaghi et al.Dec 23arXiv

TokSuite is a science lab for tokenizers: it trains 14 language models that are identical in every way except for how they split text into tokens.

#tokenization#tokenizer robustness#Byte Pair Encoding (BPE)

SemanticGen: Video Generation in Semantic Space

Intermediate
Jianhong Bai, Xiaoshi Wu et al.Dec 23arXiv

SemanticGen is a new way to make videos that starts by planning in a small, high-level 'idea space' (semantic space) and then adds the tiny visual details later.

#Video generation#Diffusion model#Semantic representation

LongVideoAgent: Multi-Agent Reasoning with Long Videos

Intermediate
Runtao Liu, Ziyi Liu et al.Dec 23arXiv

LongVideoAgent is a team of three AIs that work together to answer questions about hour‑long TV episodes without missing small details.

#long video question answering#multi-agent reasoning#temporal grounding
5455565758