Groups
Category
Level
The discounted return G_t sums all future rewards but down-weights distant rewards by powers of a discount factor ฮณ.
RLHF turns human preferences between two model outputs into training signals using a probabilistic model of choice.
The explorationโexploitation tradeoff is the tension between trying new actions to learn (exploration) and using the best-known action to earn rewards now (exploitation).
Value function approximation replaces a huge table of values with a small set of parameters that can generalize across similar states.
Proximal Policy Optimization (PPO) stabilizes policy gradient learning by preventing each update from moving the policy too far from the previous one.
Temporal Difference (TD) Learning updates value estimates by bootstrapping from the next state's current estimate, enabling fast, online learning.
A Markov Decision Process (MDP) models decision-making in situations where outcomes are partly random and partly under the control of a decision maker.
The policy gradient theorem tells us how to push a stochastic policyโs parameters to increase expected return by following the gradient of expected rewards.
Bellman equations express how the value of a state or action equals immediate reward plus discounted value of what follows.