Concepts5

Groups

RLHF Mathematics

RLHF turns human preferences between two model outputs into training signals using a probabilistic model of choice.

#rlhf#bradley-terry#pairwise comparisons+11

Exploration-Exploitation Tradeoff

The exploration–exploitation tradeoff is the tension between trying new actions to learn (exploration) and using the best-known action to earn rewards now (exploitation).

#multi-armed bandit

Concepts5

RLHF Mathematics

Exploration-Exploitation Tradeoff

Value Function Approximation

Policy Gradient Theorem

Bellman Equations