Papers943

VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

VideoSSM is a new way to make long, stable, and lively videos by giving the model two kinds of memory: a short-term window and a long-term state-space memory.

#autoregressive video diffusion#state-space model#hybrid memory

TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows

Intermediate

Zhenglin Cheng, Peng Sun et al.Dec 3arXiv

TwinFlow is a new way to make big image models draw great pictures in just one step instead of 40–100 steps.

#TwinFlow#one-step generation#twin trajectories

M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Intermediate

Adithya S Kolavi, Vyoman JainDec 3arXiv

The paper introduces M3DR, a way for computers to find the right document image no matter which of 22 languages the query or the document uses.

#multilingual retrieval#multimodal retrieval#document image search

SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning

Intermediate

Salman Rahman, Sruthi Gorantla et al.Dec 2arXiv

SPARK teaches AI to grade its own steps without needing the right answers written down anywhere.

#SPARK#Process Reward Model#PRM-CoT

Self-Improving VLM Judges Without Human Annotations

Intermediate

Inna Wanyin Lin, Yushi Hu et al.Dec 2arXiv

The paper shows how a vision-language model (VLM) can train itself to be a fair judge of answers about images without using any human preference labels.

#vision-language model#VLM judge#reward model

Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in $\{\pm 1, \pm i\}$

Intermediate

Feiyu Wang, Xinyu Tan et al.Dec 2arXiv

Fairy2i turns any pre-trained real-valued Transformer layer into an exactly equivalent complex form, so nothing changes before quantization.

#LLM quantization#complex-valued neural networks#widely-linear transformation

Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules

Beginner

Amr Mohamed, Yang Zhang et al.Dec 2arXiv

Diffusion language models (dLLMs) can write all parts of an answer in parallel, but they usually take many tiny cleanup steps, which makes them slow.

#diffusion language models#early exit decoding#progress-aware threshold

ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

Intermediate

Yifan Li, Yingda Yin et al.Dec 2arXiv

ReVSeg teaches an AI to segment objects in videos by thinking step-by-step instead of guessing everything at once.

#Reasoning Video Object Segmentation#Vision-Language Models#Temporal Grounding

PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

Intermediate

Bowen Ping, Chengyou Jia et al.Dec 2arXiv

This paper teaches image models to keep things consistent across multiple pictures—like the same character, art style, and story logic—using reinforcement learning (RL).

#consistent image generation#pairwise reward modeling#reinforcement learning

From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

Intermediate

Changpeng Yang, Jinyang Wu et al.Dec 2arXiv

This paper teaches AI models to reason better by first copying only good examples and later learning from mistakes too.

#Curriculum Advantage Policy Optimization#advantage-based RL#imitation learning

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Intermediate

Le Thien Phuc Nguyen, Zhuoran Yu et al.Dec 1arXiv

This paper introduces AV-SpeakerBench, a new test that checks if AI can truly see, hear, and understand who is speaking, what they say, and when they say it in real videos.

#audiovisual reasoning#speaker attribution#temporal grounding

Reinventing Clinical Dialogue: Agentic Paradigms for LLM Enabled Healthcare Communication

Intermediate

Xiaoquan Zhi, Hongke Zhao et al.Dec 1arXiv

Clinical conversations are special because they mix caring feelings with precise medical facts, and old AI systems struggled to do both at once.

#clinical dialogue#agentic AI#large language models

75 76 77 78 79