Groups
RLHF turns human preferences between two model outputs into training signals using a probabilistic model of choice.
The explorationโexploitation tradeoff is the tension between trying new actions to learn (exploration) and using the best-known action to earn rewards now (exploitation).