Papers6

#importance sampling

Heterogeneous Agent Collaborative Reinforcement Learning

Zhixia Zhang, Zixuan Huang et al.Mar 3arXiv

This paper introduces HACRL, a way for different kinds of AI agents to learn together during training but still work alone during use.

#HACRL#HACPO#heterogeneous agents

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Intermediate

Zeyuan Liu, Jeonghye Kim et al.Feb 26arXiv

This paper teaches a language-model agent to explore smarter by combining two ways of learning (on-policy and off-policy) with a simple, self-written memory.

#EMPO#memory-augmented agents#on-policy learning

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Intermediate

Guobin Shen, Chenxiao Zhao et al.Feb 11arXiv

VESPO is a new, stable way to train language models with reinforcement learning even when training data comes from older or mismatched policies.

#VESPO#off-policy reinforcement learning#importance sampling

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

Intermediate

Zhenghao Xu, Qin Lu et al.Feb 5arXiv

The paper studies a simple way to train giant language models with reinforcement learning by replacing a hard-to-compute term (the log-partition function) with something easy: the mean reward.

#Policy Mirror Descent#KL regularization#chi-squared regularization

FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

Intermediate

Zhaopeng Qiu, Shuang Yu et al.Jan 26arXiv

The paper shows how to speed up reinforcement learning (RL) for large language models (LLMs) by making numbers smaller (FP8) without breaking training.

#FP8 quantization#LLM reinforcement learning#KV-cache

Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems

Intermediate

Eddie Landesberg, Manjari NarayanDec 11arXiv

LLM judges are cheap but biased; without calibration they can completely flip which model looks best.

#LLM-as-judge#calibration#isotonic regression