Papers1262

A Mechanistic View on Video Generation as World Models: State and Dynamics

Luozhou Wang, Zhifei Chen et al.Jan 22arXiv

This paper says modern video generators are starting to act like tiny "world simulators," not just pretty video painters.

#world models#video generation#state representation

Not triaged yet

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Intermediate

Shengbang Tong, Boyang Zheng et al.Jan 22arXiv

Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.

#Representation Autoencoder#RAE#Variational Autoencoder

Not triaged yet

IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance

Beginner

Jongwoo Park, Kanchana Ranasinghe et al.Jan 22arXiv

IVRA is a simple, training-free add-on that helps robot brains keep the 2D shape of pictures while following language instructions.

#Vision-Language-Action#affinity map#training-free guidance

Not triaged yet

LLM-in-Sandbox Elicits General Agentic Intelligence

Beginner

Daixuan Cheng, Shaohan Huang et al.Jan 22arXiv

This paper shows that giving an AI a safe, tiny virtual computer (a sandbox) lets it solve many kinds of problems better, not just coding ones.

#LLM-in-Sandbox#Agentic Intelligence#Reinforcement Learning

Not triaged yet

360Anything: Geometry-Free Lifting of Images and Videos to 360°

Intermediate

Ziyi Wu, Daniel Watson et al.Jan 22arXiv

This paper shows how to turn any normal photo or video into a seamless 360° panorama without needing the camera’s settings like field of view or tilt.

#360 panorama generation#equirectangular projection#diffusion transformer

Not triaged yet

Learning to Discover at Test Time

Intermediate

Mert Yuksekgonul, Daniel Koceja et al.Jan 22arXiv

This paper shows how to keep training a language model while it is solving one hard, real problem, so it can discover a single, truly great answer instead of many average ones.

#test-time training#reinforcement learning#entropic objective

Not triaged yet

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Intermediate

Moo Jin Kim, Yihuai Gao et al.Jan 22arXiv

Cosmos Policy teaches robots to act by fine-tuning a powerful video model in just one training stage, without changing the model’s architecture.

#video diffusion#robot policy learning#visuomotor control

Not triaged yet

ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion

Intermediate

Remy Sabathier, David Novotny et al.Jan 22arXiv

ActionMesh is a fast, feed-forward AI that turns videos, images + text, text alone, or a given 3D model into an animated 3D mesh.

#ActionMesh#temporal 3D diffusion#animated 3D mesh

Not triaged yet

Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

Intermediate

Tingyu Song, Yanzhao Zhang et al.Jan 22arXiv

This paper introduces EDIR, a new and much more detailed test for Composed Image Retrieval (CIR), where you search for a target image using a starting image plus a short text change.

#Composed Image Retrieval#EDIR#fine-grained benchmark

Not triaged yet

SAMTok: Representing Any Mask with Two Words

Intermediate

Yikang Zhou, Tao Zhang et al.Jan 22arXiv

SAMTok turns any object’s mask in an image into just two special “words” so language models can handle pixels like they handle text.

#SAMTok#mask tokenizer#residual vector quantization

Not triaged yet

Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain

Intermediate

Özgür Uğur, Mahmut Göksu et al.Jan 22arXiv

The paper builds special Turkish legal AI models called Mecellem by teaching them from the ground up and then giving them more law-focused lessons.

#Turkish legal NLP#ModernBERT#Continual pre-training

Not triaged yet

HyperAlign: Hypernetwork for Efficient Test-Time Alignment of Diffusion Models

Intermediate

Xin Xie, Jiaxian Guo et al.Jan 22arXiv

Diffusion models make pictures from noise but often miss what people actually want in the prompt or what looks good to humans.

#diffusion models#rectified flow#hypernetwork

Not triaged yet

50 51 52 53 54