Papers11

#code generation

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

The paper shows that when a model compares two of its own answers head-to-head, it picks the right one more often than when it judges each answer alone.

#pairwise self-verification#test-time scaling#parallel reasoning

Not triaged yet

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

Intermediate

Yutong Wang, Siyuan Xiong et al.Feb 26arXiv

Multi-agent systems are like teams of smart helpers, but one bad message can mislead the whole team.

#multi-agent systems#error propagation#test-time rectification

Not triaged yet

TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

Intermediate

Chansung Park, Juyong Jiang et al.Feb 17arXiv

TAROT teaches code-writing AI the way good teachers teach kids: start at the right level and raise the bar at the right time.

#TAROT#curriculum learning#reinforcement fine-tuning

Not triaged yet

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Intermediate

Wayne Chi, Yixiong Fang et al.Feb 11arXiv

GameDevBench is a new test that checks if AI agents can actually make parts of video games, not just write code in one file.

#GameDevBench#Godot#multimodal agents

Not triaged yet

Weak-Driven Learning: How Weak Agents make Strong Agents Stronger

Intermediate

Zehao Chen, Gongxun Li et al.Feb 9arXiv

Big language models can get stuck after fine-tuning because they become too sure of themselves, so normal training stops helping.

#weak-driven learning#logit mixing#weak agents

Not triaged yet

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Intermediate

Alisia Lupidi, Bhavul Gauri et al.Feb 6arXiv

AIRS-Bench is a new test suite that checks whether AI research agents can do real machine learning research from start to finish, not just answer questions.

#AIRS-Bench#AI research agents#LLM agents

Not triaged yet

Proxy Compression for Language Modeling

Intermediate

Lin Zheng, Xinyu Li et al.Feb 4arXiv

Most language models are trained on compressed tokens, which makes training fast but ties the model to a specific tokenizer.

#proxy compression#byte-level language modeling#tokenizer-free inference

Not triaged yet

LatentMem: Customizing Latent Memory for Multi-Agent Systems

Intermediate

Muxin Fu, Guibin Zhang et al.Feb 3arXiv

LatentMem is a new memory system that helps teams of AI agents remember the right things for their specific jobs without overloading them with text.

#multi-agent systems#latent memory#role-aware memory

Not triaged yet

FourierSampler: Unlocking Non-Autoregressive Potential in Diffusion Language Models via Frequency-Guided Generation

Intermediate

Siyang He, Qiqi Wang et al.Jan 30arXiv

Diffusion language models (dLLMs) can write text in any order, but common decoding methods still prefer left-to-right, which wastes their superpower.

#diffusion language models#non-autoregressive generation#frequency-domain analysis

Not triaged yet

BatCoder: Self-Supervised Bidirectional Code-Documentation Learning via Back-Translation

Intermediate

Jingwen Xu, Yiyang Lu et al.Jan 30arXiv

BatCoder teaches a code model to write both code and its documentation by doing a round trip: from code to docs and back to code.

#back-translation#self-supervised learning#reinforcement learning for code

Not triaged yet

Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model

Intermediate

Chenghao Fan, Wen Heng et al.Jan 22arXiv

Stable-DiffCoder is a code-focused diffusion language model that learns to write and edit programs by filling in masked pieces, not just predicting the next token.

#diffusion language model#block diffusion#code generation

Not triaged yet