Papers924

Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors

Zhiwei Zhang, Fei Zhao et al.Jan 22arXiv

Small AI models often stumble when a tool call fails and then get stuck repeating bad calls instead of fixing the mistake.

#FISSION-GRPO#error recovery#tool use

Qwen3-TTS Technical Report

Intermediate

Hangrui Hu, Xinfa Zhu et al.Jan 22arXiv

Qwen3-TTS is a family of text-to-speech models that can talk in 10+ languages, clone a new voice from just 3 seconds, and follow detailed style instructions in real time.

#Qwen3-TTS#text-to-speech#voice cloning

Rethinking Video Generation Model for the Embodied World

Beginner

Yufan Deng, Zilin Pan et al.Jan 21arXiv

Robots need videos that not only look pretty but also follow real-world physics and finish the task asked of them.

#robot video generation#embodied AI#benchmark

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Intermediate

Letian Zhang, Sucheng Ren et al.Jan 21arXiv

OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).

#Unified Visual Encoder#VAE#Vision Transformer

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Intermediate

Jianshu Zhang, Chengxuan Qian et al.Jan 21arXiv

This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'

#progress reasoning#vision-language models#episodic retrieval

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Intermediate

Anmol Goel, Cornelius Emde et al.Jan 21arXiv

Benign fine-tuning meant to make language models more helpful can accidentally make them overshare private information.

#contextual privacy#privacy collapse#fine-tuning

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Intermediate

Shijie Lian, Bin Yu et al.Jan 21arXiv

Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.

#Vision-Language-Action#Bayesian decomposition#Latent Action Queries

The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

Beginner

Zanlin Ni, Shenzhi Wang et al.Jan 21arXiv

Diffusion language models can write tokens in any order, but that freedom can accidentally hurt their ability to reason well.

#diffusion language model#arbitrary order generation#autoregressive training

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Intermediate

Yifan Wang, Shiyu Li et al.Jan 21arXiv

Render-of-Thought (RoT) turns the model’s step-by-step thinking from long text into slim images so the model can think faster with fewer tokens.

#Render-of-Thought#Chain-of-Thought#Latent Reasoning

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Intermediate

Haowei Zhang, Shudong Yang et al.Jan 21arXiv

HERMES is a training-free way to make video-language models understand live, streaming video quickly and accurately.

#HERMES#KV cache#hierarchical memory

Typhoon OCR: Open Vision-Language Model For Thai Document Extraction

Beginner

Surapon Nonesung, Natapong Nitarach et al.Jan 21arXiv

Typhoon OCR is an open, lightweight vision-language model that reads Thai and English documents and returns clean, structured text.

#Thai OCR#Vision-Language Model#Document Layout Reconstruction

FARE: Fast-Slow Agentic Robotic Exploration

Beginner

Shuhao Liao, Xuxin Lv et al.Jan 21arXiv

Robots used to explore by following simple rules or short-term rewards, which often made them waste time and backtrack a lot.

#autonomous exploration#fast-slow thinking#hierarchical planning

24 25 26 27 28