In machine learning, reinforcement learning from human feedback (RLHF), including reinforcement learning from human preferences, is a technique that trains a "reward model" directly from human feedback and uses the model as a reward function to optimize an agent's policy using reinforcement learning (RL) through an optimization algorithm like Proximal Policy Optimization.[1][2] The reward model is trained in advance to the policy being optimized to predict if a given output is good (high reward) or bad (low reward). RLHF can improve the robustness and exploration of RL agents, especially when the reward function is sparse or noisy.[3]
Human feedback is most commonly collected by asking humans to rank instances of the agent's behavior.[4][5][6] These rankings can then be used to score outputs, for example with the Elo rating system.[2] While the preference judgement is widely adopted, there are other types of human feedbacks that provide richer information, such as numerical feedback, natural language feedback, and edit rate.[7]
The standard RLHF assumes the human preferences follow a Bradley-Terry model for pairwise comparisons (or Plackket-Luce for multi-wise comparisons) and minimizes the cross entropy loss to learn a reward model.[8] After learning the reward model, RLHF further fine-tunes the language model according to the learned reward model, aligning the model with human preferences.
RLHF is used in tasks where it's difficult to define a clear, algorithmic solution but where humans can easily judge the quality of the model's output. For example, if the task is to generate a compelling story, humans can rate different AI-generated stories on their quality, and the model can use their feedback to improve its story generation skills.
RLHF has been applied to various domains of natural language processing, such as conversational agents, text summarization, and natural language understanding.[9] Ordinary reinforcement learning, where agents learn from their own actions based on a "reward function", is difficult to apply to natural language processing tasks because the rewards are often not easy to define or measure, especially when dealing with complex tasks that involve human values or preferences. RLHF can enable language models to provide answers that align with these complex values, to generate more verbose responses, and to reject questions that are either inappropriate or outside the knowledge space of the model.[10] Some examples of RLHF-trained language models are OpenAI's ChatGPT and its predecessor InstructGPT,[5][11] as well as DeepMind's Sparrow.[12]
RLHF has also been applied to other areas, such as the development of video game bots. For example, OpenAI and DeepMind trained agents to play Atari games based on human preferences.[13][14] The agents achieved strong performance in many of the environments tested, often surpassing human performance.[15]
RLHF suffers from a number of challenges that can be broken down into problems with human feedback, problems with learning a reward model, and problems with optimizing the policy.[16]
One major challenge is the scalability and cost of human feedback, which can be slow and expensive compared to unsupervised learning. The quality and consistency of human feedback can also vary depending on the task, the interface, and the individual preferences of the humans. Even when human feedback is feasible, RLHF models may still exhibit undesirable behaviors that are not captured by human feedback or exploit loopholes in the reward model, which brings to light the challenges of alignment and robustness.[17]
The effectiveness of RLHF is dependent on the quality of human feedback.[18] If the feedback lacks impartiality or is inconsistent or incorrect, the model may become biased.[19] There is also a risk that the model may overfit to the feedback it receives. For instance, if feedback comes predominantly from a specific demographic or if it reflects specific biases, the model may learn not only the general alignment intended in the feedback, but also any peculiarities or noise present therein.[20][21] This excessive alignment to the specific feedback it received (or to the biases of the specific demographic that provided it) can lead to the model performing suboptimally in new contexts or when used by different groups.
Additionally, in some cases, there may be a risk of the model learning to manipulate the feedback process or game the system to achieve higher rewards, rather than genuinely improving its performance, which indicates a fault in the reward function.[22]
Researchers have surveyed a number of additional limitations to RLHF.[23]
Alternatives
An alternative to RLHF called Direct Preference Optimization (DPO) was described in 2023.[24] Like RLHF, it is used to improve pre-trained large language models using human-generated preference data. Unlike RLHF, it does not train an intermediate reward model and does not use reinforcement learning; instead, it formulates a reward function based on the human preferences and directly trains the large language model to maximize this reward.
↑Ziegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec; Amodei, Dario; Christiano, Paul; Irving, Geoffrey (2019). "Fine-Tuning Language Models from Human Preferences". arXiv:1909.08593 [cs.CL].
Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter (25 April 2018). "Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces". Proceedings of the AAAI Conference on Artificial Intelligence32 (1). doi:10.1609/aaai.v32i1.11485.
↑Fernandes, Patrick; Madaan, Aman; Liu, Emmy; Farinhas, António; Pedro Henrique Martins; Bertsch, Amanda; de Souza, José G. C.; Zhou, Shuyan; Wu, Tongshuang; Neubig, Graham; Martins, André F. T. (2023). "Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation". arXiv:2305.00955 [cs.CL].
↑Rafailov, Rafael; Sharma, Archit; Mitchell, Eric; Ermon, Stefano; Manning, Christopher D.; Finn, Chelsea (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". arXiv:2305.18290 [cs.LG].
0.00
(0 votes)
Original source: https://en.wikipedia.org/wiki/Reinforcement learning from human feedback. Read more