๐ŸŽ“How I Study AIHISA
๐Ÿ“–Read
๐Ÿ“„Papers๐Ÿ“ฐBlogs๐ŸŽฌCourses
๐Ÿ’กLearn
๐Ÿ›ค๏ธPaths๐Ÿ“šTopics๐Ÿ’กConcepts๐ŸŽดShorts
๐ŸŽฏPractice
๐Ÿ“Daily Log๐ŸŽฏPrompts๐Ÿง Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
Back to Concepts
๐ŸŽฎ

Reinforcement Learning Mathematics

Mathematical foundations of RL: Markov decision processes, Bellman equations, policy gradients, and temporal difference methods.

9 concepts

Intermediate8

โˆ‘MathIntermediate

Markov Decision Processes (MDP)

A Markov Decision Process (MDP) models decision-making in situations where outcomes are partly random and partly under the control of a decision maker.

#markov decision process#value iteration#policy iteration+12
๐Ÿ“šTheoryIntermediate

Bellman Equations

Bellman equations express how the value of a state or action equals immediate reward plus discounted value of what follows.

#bellman equation#value iteration#policy iteration+12
โš™๏ธAlgorithmIntermediate

Temporal Difference Learning

Temporal Difference (TD) Learning updates value estimates by bootstrapping from the next state's current estimate, enabling fast, online learning.

#temporal difference learning#td(0)#sarsa+12
โš™๏ธAlgorithmIntermediate

PPO & Trust Region Methods

Proximal Policy Optimization (PPO) stabilizes policy gradient learning by preventing each update from moving the policy too far from the previous one.

#ppo#trust region#trpo+11
๐Ÿ“šTheoryIntermediate

Value Function Approximation

Value function approximation replaces a huge table of values with a small set of parameters that can generalize across similar states.

#reinforcement learning#value function approximation#linear function approximator+12
๐Ÿ“šTheoryIntermediate

Exploration-Exploitation Tradeoff

The explorationโ€“exploitation tradeoff is the tension between trying new actions to learn (exploration) and using the best-known action to earn rewards now (exploitation).

#multi-armed bandit#exploration exploitation#ucb1+12
๐Ÿ“šTheoryIntermediate

RLHF Mathematics

RLHF turns human preferences between two model outputs into training signals using a probabilistic model of choice.

#rlhf#bradley-terry#pairwise comparisons+11
โˆ‘MathIntermediate

Discount Factor & Return

The discounted return G_t sums all future rewards but down-weights distant rewards by powers of a discount factor ฮณ.

#discount factor#discounted return#reinforcement learning+12

Advanced1

๐Ÿ“šTheoryAdvanced

Policy Gradient Theorem

The policy gradient theorem tells us how to push a stochastic policyโ€™s parameters to increase expected return by following the gradient of expected rewards.

#policy gradient#reinforce#actor-critic+11