Papers200

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Li-Zhong Szu-Tu, Ting-Lin Wu et al.Dec 24arXiv

The paper builds YearGuessr, a giant, worldwide photo-and-text dataset of 55,546 buildings with their construction years (1001–2024), GPS, and popularity (page views).

#YearGuessr#building age estimation#ordinal regression

Not triaged yet

C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

Beginner

Jin Qin, Zihan Liao et al.Dec 24arXiv

C2LLM is a new family of code embedding models that helps computers find the right code faster and more accurately.

#code retrieval#embedding model#cross-attention pooling

Not triaged yet

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Beginner

Zhe Cao, Tao Wang et al.Dec 24arXiv

T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.

#Text-to-Audio-Video generation#multimodal evaluation#cross-modal alignment

Not triaged yet

Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Beginner

Jinghan Li, Yang Jin et al.Dec 24arXiv

This paper introduces NExT-Vid, a way to teach a video model by asking it to guess the next frame of a video while parts of the past are hidden.

#autoregressive video pretraining#masked next-frame prediction#context isolation

Not triaged yet

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Beginner

Gül Sena Altıntaş, Malikeh Ehghaghi et al.Dec 23arXiv

TokSuite is a science lab for tokenizers: it trains 14 language models that are identical in every way except for how they split text into tokens.

#tokenization#tokenizer robustness#Byte Pair Encoding (BPE)

Not triaged yet

WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion

Beginner

Hanyang Kong, Xingyi Yang et al.Dec 22arXiv

WorldWarp is a new method that turns a single photo plus a planned camera path into a long, steady, 3D-consistent video.

#Novel View Synthesis#3D Gaussian Splatting#Spatio-Temporal Diffusion

Not triaged yet

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Beginner

Yuqiao Tan, Minzheng Wang et al.Dec 22arXiv

Large language models (LLMs) don’t act as a single brain; inside, each layer and module quietly makes its own mini-decisions called internal policies.

#Bottom-up Policy Optimization#internal layer policy#internal modular policy

Not triaged yet

Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface

Beginner

Yujie Zhao, Hongwei Fan et al.Dec 22arXiv

Robots learn better when they see many examples, but collecting lots of real videos is slow and expensive.

#robotic demonstration generation#depth-controlled video generation#metric-scale 3D reconstruction

Not triaged yet

MemEvolve: Meta-Evolution of Agent Memory Systems

Beginner

Guibin Zhang, Haotian Ren et al.Dec 21arXiv

MemEvolve teaches AI agents not only to remember past experiences but also to improve the way they remember, like a student who upgrades their study habits over time.

#LLM agents#agent memory#meta-evolution

Not triaged yet

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Beginner

Shilong Zhang, He Zhang et al.Dec 19arXiv

This paper shows that great image understanding features alone are not enough for making great images; you also need strong pixel-level detail.

#Pixel–Semantic VAE#Semantic Regularization#Off-Manifold Generation

Not triaged yet

Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

Beginner

Jiaqi Tang, Jianmin Chen et al.Dec 19arXiv

Robust-R1 teaches vision-language models to notice how a picture is damaged, think through what that damage hides, and then answer as if the picture were clear.

#Robust-R1#degradation-aware reasoning#multimodal large language models

Not triaged yet

Next-Embedding Prediction Makes Strong Vision Learners

Beginner

Sihan Xu, Ziqiao Ma et al.Dec 18arXiv

This paper introduces NEPA, a very simple way to teach vision models by having them predict the next patch’s embedding in an image sequence, just like language models predict the next word.

#self-supervised learning#vision transformer#autoregression

Not triaged yet

12 13 14 15 16