Papers1262

Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision

Wei Du, Shubham Toshniwal et al.Dec 17arXiv

Nemotron-Math is a giant math dataset with 7.5 million step-by-step solutions created in three thinking styles and with or without Python help.

#mathematical reasoning#long-context fine-tuning#multi-mode supervision

Not triaged yet

Step-GUI Technical Report

Intermediate

Haolong Yan, Jia Wang et al.Dec 17arXiv

This paper builds Step-GUI, a pair of small-but-strong GUI agent models (4B/8B) that can use phones and computers by looking at screenshots and following instructions.

#GUI automation#multimodal large language models#trajectory-level calibration

Not triaged yet

SCOPE: Prompt Evolution for Enhancing Agent Effectiveness

Beginner

Zehua Pei, Hui-Ling Zhen et al.Dec 17arXiv

SCOPE lets AI agents rewrite their own instructions while they are working, so they can fix mistakes and get smarter on the next step, not just the next task.

#prompt evolution#LLM agents#context management

Not triaged yet

Robust and Calibrated Detection of Authentic Multimedia Content

Intermediate

Sarim Hashmi, Abdelrahman Elsayed et al.Dec 17arXiv

Deepfakes are getting so good that simple yes/no detectors are failing, especially when attackers add tiny, invisible changes.

#Authenticity Index#calibrated resynthesis#reconstruction-free inversion

Not triaged yet

DEER: Draft with Diffusion, Verify with Autoregressive Models

Intermediate

Zicong Cheng, Guo-Wei Yang et al.Dec 17arXiv

DEER is a new way to speed up big language models by letting a diffusion model draft many tokens at once and an autoregressive model double-check them.

#DEER#speculative decoding#diffusion LLM

Not triaged yet

Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets

Intermediate

Jialong Zuo, Haoyou Deng et al.Dec 17arXiv

This paper checks if a popular text-to-image model called Nano Banana Pro can fix messy photos without any extra training.

#low-level vision#zero-shot restoration#generative models

Not triaged yet

Prompt Repetition Improves Non-Reasoning LLMs

Beginner

Yaniv Leviathan, Matan Kalman et al.Dec 17arXiv

Repeating the entire prompt once (QUERY→QUERY+QUERY) helps many large language models answer better when you are not asking them to show their reasoning.

#prompt repetition#non-reasoning LLMs#causal attention

Not triaged yet

Puzzle Curriculum GRPO for Vision-Centric Reasoning

Intermediate

Ahmadreza Jeddi, Hakki Can Karaimer et al.Dec 16arXiv

This paper teaches vision-language models to reason about pictures using puzzles instead of expensive human labels.

#vision-language models#reinforcement learning#group-relative policy optimization

Not triaged yet

HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

Beginner

Dan Ben-Ami, Gabriele Serussi et al.Dec 16arXiv

HERBench is a new test that checks if video AI models can combine several clues spread across time, not just guess from one frame or language priors.

#Video Question Answering#Video-LLM#Multi-Evidence Integration

Not triaged yet

MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Intermediate

Sihui Ji, Xi Chen et al.Dec 16arXiv

MemFlow is a new way for AI to remember the right parts of a long video story while it keeps making new parts, so characters and scenes stay consistent.

#MemFlow#Narrative Adaptive Memory#Sparse Memory Activation

Not triaged yet

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Intermediate

Jun Zhang, Teng Wang et al.Dec 16arXiv

TimeLens studies how to teach AI not just what happens in a video, but exactly when it happens, which is called video temporal grounding (VTG).

#video temporal grounding#multimodal large language models#benchmark re-annotation

Not triaged yet

Spherical Leech Quantization for Visual Tokenization and Generation

Intermediate

Yue Zhao, Hanwen Jiang et al.Dec 16arXiv

This paper shows a simple, math-guided way to turn image pieces into tidy symbols (tokens) using points spread evenly on a sphere.

#Spherical Leech Quantization#Leech lattice#spherical codes

Not triaged yet

89 90 91 92 93