Papers943

Robust and Calibrated Detection of Authentic Multimedia Content

Sarim Hashmi, Abdelrahman Elsayed et al.Dec 17arXiv

Deepfakes are getting so good that simple yes/no detectors are failing, especially when attackers add tiny, invisible changes.

#Authenticity Index#calibrated resynthesis#reconstruction-free inversion

DEER: Draft with Diffusion, Verify with Autoregressive Models

Intermediate

Zicong Cheng, Guo-Wei Yang et al.Dec 17arXiv

DEER is a new way to speed up big language models by letting a diffusion model draft many tokens at once and an autoregressive model double-check them.

#DEER#speculative decoding#diffusion LLM

Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets

Intermediate

Jialong Zuo, Haoyou Deng et al.Dec 17arXiv

This paper checks if a popular text-to-image model called Nano Banana Pro can fix messy photos without any extra training.

#low-level vision#zero-shot restoration#generative models

Puzzle Curriculum GRPO for Vision-Centric Reasoning

Intermediate

Ahmadreza Jeddi, Hakki Can Karaimer et al.Dec 16arXiv

This paper teaches vision-language models to reason about pictures using puzzles instead of expensive human labels.

#vision-language models#reinforcement learning#group-relative policy optimization

HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

Beginner

Dan Ben-Ami, Gabriele Serussi et al.Dec 16arXiv

HERBench is a new test that checks if video AI models can combine several clues spread across time, not just guess from one frame or language priors.

#Video Question Answering#Video-LLM#Multi-Evidence Integration

MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Intermediate

Sihui Ji, Xi Chen et al.Dec 16arXiv

MemFlow is a new way for AI to remember the right parts of a long video story while it keeps making new parts, so characters and scenes stay consistent.

#MemFlow#Narrative Adaptive Memory#Sparse Memory Activation

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Intermediate

Jun Zhang, Teng Wang et al.Dec 16arXiv

TimeLens studies how to teach AI not just what happens in a video, but exactly when it happens, which is called video temporal grounding (VTG).

#video temporal grounding#multimodal large language models#benchmark re-annotation

Spherical Leech Quantization for Visual Tokenization and Generation

Intermediate

Yue Zhao, Hanwen Jiang et al.Dec 16arXiv

This paper shows a simple, math-guided way to turn image pieces into tidy symbols (tokens) using points spread evenly on a sphere.

#Spherical Leech Quantization#Leech lattice#spherical codes

CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

Intermediate

Zihan Wang, Jiashun Wang et al.Dec 16arXiv

CRISP turns a normal phone video of a person into a clean 3D world and a virtual human that can move in it without breaking physics.

#real-to-sim#human-scene interaction#planar primitives

MMGR: Multi-Modal Generative Reasoning

Intermediate

Zefan Cai, Haoyi Qiu et al.Dec 16arXiv

MMGR is a new benchmark that checks whether AI image and video generators follow real-world rules, not just whether their outputs look pretty.

#multi-modal generative reasoning#video generation evaluation#physical commonsense

Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

Intermediate

Lanxiang Hu, Siqi Kou et al.Dec 16arXiv

Autoregressive (AR) models normally write one token at a time, which is accurate but slow for long answers.

#Jacobi Forcing#Jacobi decoding#consistency distillation

EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models

Intermediate

Zechen Bai, Chen Gao et al.Dec 16arXiv

Robots usually learn by copying many demonstrations, which is expensive and makes them brittle when things change.

#EVOLVE-VLA#test-time training#vision-language-action

63 64 65 66 67