Reinforcement learning from human feedback (RLHF) is a training technique that uses direct human input to teach an AI model what good output looks like. Rather than relying on a mathematically defined objective, RLHF builds a reward model from human preferences and uses it to guide the AI's behavior through reinforcement learning.
RLHF typically follows four stages.
- The process starts with an already pretrained base model rather than training from scratch.
- Human-labeled examples are used to fine-tune the model and align its output format with user expectations.
- Human evaluators compare and rank pairs of model responses, generating the preference data used to train a reward model.
- The reward model then guides behavioral updates through proximal policy optimization (PPO), which updates the model incrementally to prevent it from changing so drastically that it starts producing nonsensical output.
RLHF is central to the development of modern large language models, including OpenAI's GPT series. It has helped improve instruction-following and reduced AI hallucinations. It is also closely related to instruction tuning, which uses supervised learning on labeled instruction-response pairs to achieve similar alignment goals.
The main limitations of RLHF are cost, subjectivity, and the risk of overfitting to a narrow set of human preferences.