Papers200

Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

This paper shows that many reasoning failures in AI are caused by just a few distracting words in the prompt, not because the problems are too hard.

#LENS#Interference Tokens#Reinforcement Learning with Verifiable Rewards

Not triaged yet

Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report

Beginner

Zhuoran Yang, Ed Li et al.Jan 28arXiv

This paper introduces Foundation-Sec-8B-Reasoning, a small (8 billion parameter) AI model that is trained to “think out loud” before answering cybersecurity questions.

#native reasoning#cybersecurity LLM#chain-of-thought

Not triaged yet

DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

Beginner

Nikita Gupta, Riju Chatterjee et al.Jan 28arXiv

DeepSearchQA is a new test with 900 real-world style questions that checks if AI agents can find complete lists of answers, not just one fact.

#DeepSearchQA#agentic information retrieval#systematic collation

Not triaged yet

AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

Beginner

Kaiyuan Chen, Qimin Wu et al.Jan 28arXiv

This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.

#AgentIF-OneDay#instruction following#AI agents

Not triaged yet

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Beginner

Zengbin Wang, Xuecai Hu et al.Jan 28arXiv

Text-to-image models draw pretty pictures, but often put things in the wrong places or miss how objects interact.

#text-to-image#spatial intelligence#occlusion

Not triaged yet

TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

Beginner

Fangxu Yu, Xingang Guo et al.Jan 26arXiv

TSRBench is a giant test that checks if AI models can understand and reason about data that changes over time, like heartbeats, stock prices, and weather.

#time series reasoning#multimodal benchmark#perception

Not triaged yet

HalluCitation Matters: Revealing the Impact of Hallucinated References with 300 Hallucinated Papers in ACL Conferences

Beginner

Yusuke Sakai, Hidetaka Kamigaito et al.Jan 26arXiv

The paper finds almost 300 accepted NLP papers (mostly in 2025) that include at least one fake or non-existent reference, which the authors call a HalluCitation.

#HalluCitation#hallucinated citations#citation verification

Not triaged yet

Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents

Beginner

Zhihan Liu, Lin Guan et al.Jan 26arXiv

LLM agents are usually trained in a few worlds but asked to work in many different, unseen worlds, which often hurts their performance.

#cross-domain generalization#state information richness#planning complexity

Not triaged yet

VIBEVOICE-ASR Technical Report

Beginner

Zhiliang Peng, Jianwei Yu et al.Jan 26arXiv

VIBEVOICE-ASR is a single-pass system that listens to up to 60 minutes of audio at once and outputs who spoke, when they spoke, and what they said in one stream.

#long-form ASR#speaker diarization#timestamping

Not triaged yet

Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models

Beginner

Kunat Pipatanakul, Pittawat TaveekitworachaiJan 26arXiv

Typhoon-S is a simple, open recipe that turns a basic language model into a helpful assistant and then teaches it important local skills, all on small budgets.

#Typhoon-S#on-policy distillation#full-logits distillation

Not triaged yet

Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

Beginner

Zecheng Tang, Quantong Qiu et al.Jan 24arXiv

Transformers slow down on very long inputs because standard attention looks at every token pair, which is expensive.

#elastic attention#sparse attention#full attention

Not triaged yet

LongCat-Flash-Thinking-2601 Technical Report

Beginner

Meituan LongCat Team, Anchun Gui et al.Jan 23arXiv

LongCat-Flash-Thinking-2601 is a huge 560-billion-parameter Mixture-of-Experts model built to act like a careful helper that can use tools, browse, code, and solve multi-step tasks.

#Agentic reasoning#Mixture-of-Experts#Asynchronous reinforcement learning

Not triaged yet

6 7 8 9 10