Papers32

#LoRA fine-tuning

Learning to Discover at Test Time

Mert Yuksekgonul, Daniel Koceja et al.Jan 22arXiv

This paper shows how to keep training a language model while it is solving one hard, real problem, so it can discover a single, truly great answer instead of many average ones.

#test-time training#reinforcement learning#entropic objective

Not triaged yet

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Intermediate

Qian Chen, Jinlan Fu et al.Jan 20arXiv

FutureOmni is the first benchmark that tests if multimodal AI models can predict what happens next from both sound and video, not just explain what already happened.

#multimodal LLM#audio-visual reasoning#future forecasting

Not triaged yet

CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation

Intermediate

Shuai Tan, Biao Gong et al.Jan 16arXiv

CoDance is a new way to animate many characters in one picture using just one pose video, even if the picture and the video do not line up perfectly.

#multi-subject animation#pose-guided video generation#Unbind–Rebind paradigm

Not triaged yet

MoCha:End-to-End Video Character Replacement without Structural Guidance

Intermediate

Zhengbo Xu, Jie Ma et al.Jan 13arXiv

MoCha is a new AI that swaps a person in a video with a new character using only one mask on one frame and a few reference photos.

#video diffusion#character replacement#in-context learning

Not triaged yet

More Images, More Problems? A Controlled Analysis of VLM Failure Modes

Intermediate

Anurag Das, Adrian Bulat et al.Jan 12arXiv

Large Vision-Language Models (LVLMs) look great on single images but often stumble when they must reason across multiple images.

#Large Vision-Language Models#Multi-image reasoning#Cross-image aggregation

Not triaged yet

MeepleLM: A Virtual Playtester Simulating Diverse Subjective Experiences

Intermediate

Zizhen Li, Chuanhao Li et al.Jan 12arXiv

MeepleLM is a special AI that reads a board game’s rulebook and pretends to be different kinds of players to give helpful, honest feedback.

#virtual playtesting#persona-aligned critique#MDA reasoning

Not triaged yet

RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

Intermediate

Boyang Wang, Haoran Zhang et al.Jan 8arXiv

RoboVIP is a plug-and-play tool that turns ordinary robot videos into many new, realistic, multi-view training videos without changing the original robot actions.

#robotic manipulation#video diffusion#multi-view generation

Not triaged yet

COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs

Intermediate

Dasol Choi, DongGeon Lee et al.Jan 5arXiv

COMPASS is a new framework that turns a company’s rules into thousands of smart test questions to check if chatbots follow those rules.

#policy alignment#allowlist denylist#enterprise AI safety

Not triaged yet

Factorized Learning for Temporally Grounded Video-Language Models

Intermediate

Wenzheng Zeng, Difei Gao et al.Dec 30arXiv

This paper teaches video-language models to first find when the proof happens in a video and then answer with that proof, instead of mixing both steps together.

#temporal grounding#video-language models#evidence tokens

Not triaged yet

Pretraining Frame Preservation in Autoregressive Video Memory Compression

Intermediate

Lvmin Zhang, Shengqu Cai et al.Dec 29arXiv

The paper teaches a video model to squeeze long video history into a tiny memory while still keeping sharp details in single frames.

#autoregressive video generation#video memory compression#frame retrieval pretraining

Not triaged yet

Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Intermediate

Shaocong Xu, Songlin Wei et al.Dec 29arXiv

Transparent and shiny objects confuse normal depth cameras, but video diffusion models already learned how light bends and reflects through them.

#video diffusion model#transparent object depth#normal estimation

Not triaged yet

Act2Goal: From World Model To General Goal-conditioned Policy

Intermediate

Pengfei Zhou, Liliang Chen et al.Dec 29arXiv

Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.

#goal-conditioned policy#visual world model#multi-scale temporal hashing

Not triaged yet

1 2 3