Groups
Category
The reparameterization trick rewrites a random variable as a deterministic function of noise that does not depend on the parameters, such as z = ฮผ + ฯ ยท ฮต with ฮต ~ N(0, 1).
The policy gradient theorem tells us how to push a stochastic policyโs parameters to increase expected return by following the gradient of expected rewards.