Papers924

DSGym: A Holistic Framework for Evaluating and Training Data Science Agents

DSGym is a unified 'gym' where AI data science agents are tested and trained by actually running code on real datasets, not just chatting about them.

#DSGym#data science agents#execution-grounded evaluation

Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory

Intermediate

Dohun Lee, Chun-Hao Paul Huang et al.Jan 22arXiv

Memory-V2V teaches video editing AIs to remember what they already changed so new edits stay consistent with old ones.

#multi-turn video editing#video-to-video diffusion#explicit memory

GameTalk: Training LLMs for Strategic Conversation

Intermediate

Victor Conchello Vendrell, Max Ruiz Luyten et al.Jan 22arXiv

Large language models usually get judged one message at a time, but many real tasks need smart planning across a whole conversation.

#strategic conversation#reinforcement learning for LLMs#multi-turn dialogue

A Mechanistic View on Video Generation as World Models: State and Dynamics

Intermediate

Luozhou Wang, Zhifei Chen et al.Jan 22arXiv

This paper says modern video generators are starting to act like tiny "world simulators," not just pretty video painters.

#world models#video generation#state representation

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Intermediate

Shengbang Tong, Boyang Zheng et al.Jan 22arXiv

Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.

#Representation Autoencoder#RAE#Variational Autoencoder

IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance

Beginner

Jongwoo Park, Kanchana Ranasinghe et al.Jan 22arXiv

IVRA is a simple, training-free add-on that helps robot brains keep the 2D shape of pictures while following language instructions.

#Vision-Language-Action#affinity map#training-free guidance

LLM-in-Sandbox Elicits General Agentic Intelligence

Beginner

Daixuan Cheng, Shaohan Huang et al.Jan 22arXiv

This paper shows that giving an AI a safe, tiny virtual computer (a sandbox) lets it solve many kinds of problems better, not just coding ones.

#LLM-in-Sandbox#Agentic Intelligence#Reinforcement Learning

360Anything: Geometry-Free Lifting of Images and Videos to 360°

Intermediate

Ziyi Wu, Daniel Watson et al.Jan 22arXiv

This paper shows how to turn any normal photo or video into a seamless 360° panorama without needing the camera’s settings like field of view or tilt.

#360 panorama generation#equirectangular projection#diffusion transformer

Learning to Discover at Test Time

Intermediate

Mert Yuksekgonul, Daniel Koceja et al.Jan 22arXiv

This paper shows how to keep training a language model while it is solving one hard, real problem, so it can discover a single, truly great answer instead of many average ones.

#test-time training#reinforcement learning#entropic objective

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Intermediate

Moo Jin Kim, Yihuai Gao et al.Jan 22arXiv

Cosmos Policy teaches robots to act by fine-tuning a powerful video model in just one training stage, without changing the model’s architecture.

#video diffusion#robot policy learning#visuomotor control

ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion

Intermediate

Remy Sabathier, David Novotny et al.Jan 22arXiv

ActionMesh is a fast, feed-forward AI that turns videos, images + text, text alone, or a given 3D model into an animated 3D mesh.

#ActionMesh#temporal 3D diffusion#animated 3D mesh

Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

Intermediate

Tingyu Song, Yanzhao Zhang et al.Jan 22arXiv

This paper introduces EDIR, a new and much more detailed test for Composed Image Retrieval (CIR), where you search for a target image using a starting image plus a short text change.

#Composed Image Retrieval#EDIR#fine-grained benchmark

22 23 24 25 26