Papers1262

All Beginner Intermediate Advanced

All Sources arXiv

HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

Intermediate

Haonan Qiu, Shikun Liu et al.Dec 24arXiv

HiStream makes 1080p video generation much faster by removing repeated work across space, time, and steps.

#high-resolution video generation#diffusion transformer (DiT)#dual-resolution caching

Not triaged yet

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Beginner

Li-Zhong Szu-Tu, Ting-Lin Wu et al.Dec 24arXiv

The paper builds YearGuessr, a giant, worldwide photo-and-text dataset of 55,546 buildings with their construction years (1001–2024), GPS, and popularity (page views).

#YearGuessr#building age estimation#ordinal regression

Not triaged yet

Streaming Video Instruction Tuning

Intermediate

Jiaer Xia, Peixian Chen et al.Dec 24arXiv

Streamo is a real-time video assistant that knows when to stay quiet, when to wait, and when to speak—while a video is still playing.

#streaming video LLM#real-time video understanding#instruction tuning

Not triaged yet

C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

Beginner

Jin Qin, Zihan Liao et al.Dec 24arXiv

C2LLM is a new family of code embedding models that helps computers find the right code faster and more accurately.

#code retrieval#embedding model#cross-attention pooling

Not triaged yet

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Intermediate

Jiawei Liu, Junqiao Li et al.Dec 24arXiv

DreaMontage is a new AI method that makes long, single-shot videos that feel smooth and connected, even when you give it scattered images or short clips in the middle.

#arbitrary frame conditioning#one-shot video generation#Diffusion Transformer

Not triaged yet

Latent Implicit Visual Reasoning

Intermediate

Kelvin Li, Chuyi Shang et al.Dec 24arXiv

Large Multimodal Models (LMMs) are great at reading text and looking at pictures, but they usually do most of their thinking in words, which limits deep visual reasoning.

#Latent Implicit Visual Reasoning#latent tokens#bottleneck attention masking

Not triaged yet

UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement

Intermediate

Tanghui Jia, Dongyu Yan et al.Dec 24arXiv

UltraShape 1.0 is a two-step 3D generator that first makes a simple overall shape and then zooms in to add tiny details.

#3D diffusion#coarse-to-fine generation#voxel-based refinement

Not triaged yet

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Beginner

Zhe Cao, Tao Wang et al.Dec 24arXiv

T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.

#Text-to-Audio-Video generation#multimodal evaluation#cross-modal alignment

Not triaged yet

Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Beginner

Jinghan Li, Yang Jin et al.Dec 24arXiv

This paper introduces NExT-Vid, a way to teach a video model by asking it to guess the next frame of a video while parts of the past are hidden.

#autoregressive video pretraining#masked next-frame prediction#context isolation

Not triaged yet

Quantile Rendering: Efficiently Embedding High-dimensional Feature on 3D Gaussian Splatting

Intermediate

Yoonwoo Jeong, Cheng Sun et al.Dec 24arXiv

This paper speeds up how 3D scenes handle big, 512‑dimensional features without throwing away important information.

#3D Gaussian Splatting#Quantile Rendering#Open-vocabulary segmentation

Not triaged yet

NVIDIA Nemotron 3: Efficient and Open Intelligence

Intermediate

NVIDIA, : et al.Dec 24arXiv

Nemotron 3 is a new family of open AI models (Nano, Super, Ultra) built to think better while running faster and cheaper.

#Nemotron 3#Mixture-of-Experts#LatentMoE

Not triaged yet

Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Intermediate

NVIDIA, : et al.Dec 23arXiv

Nemotron 3 Nano is a new open-source language model that mixes two brain styles (Mamba and Transformer) and adds a team of special experts (MoE) so it thinks better while running much faster.

#Mixture-of-Experts#Mamba-2#Transformer

Not triaged yet

80 81 82 83 84