🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers13

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#cross-attention

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Intermediate
Yue Ding, Yiyan Ji et al.Feb 4arXiv

OmniSIFT is a new way to shrink (compress) audio and video tokens so omni-modal language models can think faster without forgetting important details.

#Omni-LLM#token compression#modality-asymmetric

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Intermediate
Zhixue Fang, Xu He et al.Feb 3arXiv

This paper introduces 3DiMo, a new way to control how people move in generated videos while keeping the camera moves flexible through text.

#3D-aware motion#implicit motion encoder#motion tokens

PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

Intermediate
Minh-Quan Le, Gaurav Mittal et al.Feb 2arXiv

This paper shows how to make text-to-video models create clearer, steadier, and more on-topic videos without using any human-labeled ratings.

#text-to-video#optimal transport#annotation-free

Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis

Intermediate
Hongyuan Chen, Xingyu Chen et al.Jan 20arXiv

Motion 3-to-4 turns a single regular video into a moving 3D object over time (a 4D asset) by first getting the object’s shape and then figuring out how every part moves.

#4D synthesis#monocular video#motion reconstruction

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Intermediate
Mingxin Li, Yanzhao Zhang et al.Jan 8arXiv

This paper builds two teamwork models, Qwen3-VL-Embedding and Qwen3-VL-Reranker, that understand text, images, visual documents, and videos in one shared space so search works across all of them.

#multimodal retrieval#unified embedding space#cross-encoder reranker

LTX-2: Efficient Joint Audio-Visual Foundation Model

Intermediate
Yoav HaCohen, Benny Brazowski et al.Jan 6arXiv

LTX-2 is an open-source model that makes video and sound together from a text prompt, so the picture and audio match in time and meaning.

#text-to-video#text-to-audio#audiovisual generation

Pretraining Frame Preservation in Autoregressive Video Memory Compression

Intermediate
Lvmin Zhang, Shengqu Cai et al.Dec 29arXiv

The paper teaches a video model to squeeze long video history into a tiny memory while still keeping sharp details in single frames.

#autoregressive video generation#video memory compression#frame retrieval pretraining

Act2Goal: From World Model To General Goal-conditioned Policy

Intermediate
Pengfei Zhou, Liliang Chen et al.Dec 29arXiv

Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.

#goal-conditioned policy#visual world model#multi-scale temporal hashing

CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

Intermediate
Moritz Böhle, Amélie Royer et al.Dec 22arXiv

CASA is a new way to mix images and text inside a language model that keeps speed and memory low while keeping accuracy high.

#CASA#cross-attention#self-attention

DeContext as Defense: Safe Image Editing in Diffusion Transformers

Intermediate
Linghui Shen, Mingyue Cui et al.Dec 18arXiv

This paper protects your photos from being misused by new AI image editors that can copy your face or style from just one picture.

#Diffusion Transformer#cross-attention#in-context image editing

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Intermediate
Yicheng Feng, Wanpeng Zhang et al.Dec 15arXiv

Robots often see the world as flat pictures but must move in a 3D world, which makes accurate actions hard.

#Vision-Language-Action#3D spatial grounding#visual-physical alignment

Composing Concepts from Images and Videos via Concept-prompt Binding

Intermediate
Xianghao Kong, Zeyu Zhang et al.Dec 10arXiv

This paper introduces BiCo, a one-shot way to mix ideas from images and videos by tightly tying each visual idea to the exact words in a prompt.

#BiCo#concept binding#token-level composition
12