How I Study AI - Learn AI Papers & Lectures the Easy Way

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

Intermediate

Zhenghao Xu, Qin Lu et al.Feb 5arXiv

The paper studies a simple way to train giant language models with reinforcement learning by replacing a hard-to-compute term (the log-partition function) with something easy: the mean reward.

#Policy Mirror Descent#KL regularization#chi-squared regularization

FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

Intermediate

Zhaopeng Qiu, Shuang Yu et al.Jan 26arXiv

The paper shows how to speed up reinforcement learning (RL) for large language models (LLMs) by making numbers smaller (FP8) without breaking training.

#FP8 quantization#LLM reinforcement learning#KV-cache

Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems

Intermediate

Eddie Landesberg, Manjari NarayanDec 11arXiv

LLM judges are cheap but biased; without calibration they can completely flip which model looks best.

#LLM-as-judge#calibration#isotonic regression

Papers3

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems