Papers1055

MMGR: Multi-Modal Generative Reasoning

MMGR is a new benchmark that checks whether AI image and video generators follow real-world rules, not just whether their outputs look pretty.

#multi-modal generative reasoning#video generation evaluation#physical commonsense

Not triaged yet

Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

Intermediate

Lanxiang Hu, Siqi Kou et al.Dec 16arXiv

Autoregressive (AR) models normally write one token at a time, which is accurate but slow for long answers.

#Jacobi Forcing#Jacobi decoding#consistency distillation

Not triaged yet

EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models

Intermediate

Zechen Bai, Chen Gao et al.Dec 16arXiv

Robots usually learn by copying many demonstrations, which is expensive and makes them brittle when things change.

#EVOLVE-VLA#test-time training#vision-language-action

Not triaged yet

VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse

Intermediate

Ying Nie, Kai Han et al.Dec 16arXiv

Large language models get smarter when they get bigger, but storing all those extra weights eats tons of memory.

#VersatileFFN#parameter efficiency#virtual experts

Not triaged yet

RecGPT-V2 Technical Report

Intermediate

Chao Yi, Dian Chen et al.Dec 16arXiv

RecGPT‑V2 turns a recommender system into a smart team: a planner, several specialists, and a fair judge that all work together.

#Recommender systems#Large language models#Hierarchical multi‑agent system

Not triaged yet

A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

Intermediate

Zixin Zhang, Kanghao Chen et al.Dec 16arXiv

This paper builds A4-Agent, a smart three-part helper that figures out where to touch or use an object just from a picture and a written instruction, without any extra training.

#affordance prediction#zero-shot learning#vision-language models

Not triaged yet

RePo: Language Models with Context Re-Positioning

Intermediate

Huayang Li, Tianyu Zhao et al.Dec 16arXiv

Large language models usually line words up in fixed order slots, which can waste mental energy and make it harder to find the important parts of a long or noisy text.

#context re-positioning#positional encoding#self-attention

Not triaged yet

Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure

Intermediate

Jooyeol Yun, Jaegul ChooDec 16arXiv

Vector Prism helps computers animate SVG images by first discovering which tiny shapes belong together as meaningful parts.

#SVG animation#semantic restructuring#vision–language models

Not triaged yet

SS4D: Native 4D Generative Model via Structured Spacetime Latents

Intermediate

Zhibing Li, Mengchen Zhang et al.Dec 16arXiv

SS4D is a new AI model that turns a short single-camera video into a full 3D object that moves over time (that’s 4D), and it does this in about 2 minutes.

#4D generation#structured spacetime latents#temporal attention

Not triaged yet

Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in

Intermediate

Xiaoqian Shen, Min-Hung Chen et al.Dec 16arXiv

Zoom-Zero helps AI answer questions about videos by first finding the right moment and then zooming in to double-check tiny details.

#Grounded Video Question Answering#Temporal Grounding#Coarse-to-Fine

Not triaged yet

Understanding and Improving Hyperbolic Deep Reinforcement Learning

Intermediate

Timo Klein, Thomas Lang et al.Dec 16arXiv

Reinforcement learning agents often see the world in straight, flat space (Euclidean), but many decision problems look more like branching trees that fit curved, hyperbolic space better.

#hyperbolic reinforcement learning#Hyperboloid#Poincaré Ball

Not triaged yet

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

Intermediate

Wentao Guo, Mayank Mishra et al.Dec 16arXiv

SonicMoE makes Mixture-of-Experts (MoE) models train faster and use less memory by redesigning how data is moved and computed on GPUs.

#Mixture of Experts#Grouped GEMM#Token Rounding

Not triaged yet

75 76 77 78 79