Papers943

AnyDepth: Depth Estimation Made Easy

AnyDepth is a new, simple way for a computer to tell how far things are in a picture using just one image (monocular depth).

#monocular depth estimation#zero-shot depth#Simple Depth Transformer

SimpleMem: Efficient Lifelong Memory for LLM Agents

Intermediate

Jiaqi Liu, Yaofeng Su et al.Jan 5arXiv

SimpleMem is a new memory system that helps AI remember long conversations without wasting space or tokens.

#LLM memory#semantic compression#online synthesis

VINO: A Unified Visual Generator with Interleaved OmniModal Context

Beginner

Junyi Chen, Tong He et al.Jan 5arXiv

VINO is a single AI model that can make and edit both images and videos by listening to text and looking at reference pictures and clips at the same time.

#VINO#unified visual generator#multimodal diffusion transformer

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Intermediate

Jing Tan, Zhaoyang Zhang et al.Jan 5arXiv

Talk2Move is a training recipe that lets an image editor move, rotate, and resize the exact object you mention using plain text, while keeping the rest of the picture stable.

#text-guided image editing#object-level transformation#reinforcement learning

Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling

Beginner

Falcon LLM Team, Iheb Chaabane et al.Jan 5arXiv

Falcon-H1R is a small (7B) AI model that thinks really well without needing giant computers.

#Falcon-H1R#Hybrid Transformer-Mamba#Chain-of-Thought

InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

Intermediate

Shuai Yuan, Yantai Yang et al.Jan 5arXiv

InfiniteVGGT is a streaming 3D vision system that can keep working forever on live video without running out of memory.

#InfiniteVGGT#rolling memory#causal attention

DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies

Intermediate

Renke Wang, Zhenyu Zhang et al.Jan 5arXiv

DiffProxy turns tricky multi-camera photos of a person into a clean 3D body and hands by first painting a precise 'map' on each pixel and then fitting a standard body model to that map.

#human mesh recovery#SMPL-X#dense correspondence

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

Intermediate

Shikun Sun, Liao Qu et al.Jan 5arXiv

Visual Autoregressive (VAR) models draw whole grids of image tokens at once across multiple scales, which makes standard reinforcement learning (RL) unstable.

#Visual Autoregressive (VAR)#Reinforcement Learning#GRPO

VIBE: Visual Instruction Based Editor

Intermediate

Grigorii Alekseenko, Aleksandr Gordeev et al.Jan 5arXiv

VIBE is a tiny but mighty image editor that listens to your words and changes pictures while keeping the original photo intact unless you ask otherwise.

#instruction-based image editing#vision-language model#diffusion model

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

Intermediate

Huichao Zhang, Liao Qu et al.Jan 5arXiv

NextFlow is a single, decoder-only Transformer that can read and write both text and images in one continuous sequence.

#Next-Scale Prediction#Autoregressive Transformer#Dual-Codebook Tokenization

Confidence Estimation for LLMs in Multi-turn Interactions

Intermediate

Caiqi Zhang, Ruihan Yang et al.Jan 5arXiv

This paper studies how sure (confident) large language models are during multi-turn chats where clues arrive step by step.

#multi-turn confidence estimation#LLM calibration#InfoECE

Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting

Intermediate

Muxi Diao, Lele Yang et al.Jan 5arXiv

Supervised fine-tuning (SFT) often makes a model great at a new task but worse at its old skills; this paper explains a key reason why and how to fix it.

#Entropy-Adaptive Fine-Tuning#confident conflicts#token-level entropy

45 46 47 48 49