chatGPT
We solemnly promise that in the writing of this course we have not made use of chatGPT (or not too much)
Last updated
We solemnly promise that in the writing of this course we have not made use of chatGPT (or not too much)
Last updated
ChatGPT and its sibling model, InstructGPT are both designed to fix that problem and be more aligned with its users via the use of RLHF.
chatGPT has been fine-tuned on with a combination of both Supervised Learning and Reinforcement Learning.
Specifically, chatGPT uses Reinforcement Learning with Human Feedback (RLHF), which uses human feedback in the training loop to improve dialogue interaction, minimize harmful, untruthful, and/or biased outputs.
With Reinforcement Learning, AI trainers were able to rank the responses from the model.
Periodically, a human ranker receives two examples of chatGPT answers and will decide which one suits the current task.
The AI agent will simultaneously be building a model of the goal of the task and will refine it via using RL.
As it learns the behaviour, the AI agent will start to only ask for human feedback on videos it is uncertain on, and further refines its understanding.
The core architecture of chatGPT relies on a “human-annotated data + reinforcement learning” (RLHF - Reinforcement Learning with Human Feedback) methods.
The main idea of using RLHF is to continuously fine-tine the underlying language model to understand the meaning of human commands.
ChatGPT includes some differences in the data collection setup by including supervised fine-tuning with human AI trainers for both the user and an AI assistant.
The core ChatGPT training process is segmented in three main phases:
The objective of this first phase is to fine tune GPT-3.5 policy to understand a specific set of user commands. During this phase, users submit different batches of prompts and high quality answers are provided by human labelers. The <prompt, answer> dataset is then used to fine tune GPT-3.5 in order to better understand actions contained in prompts.
The goal of the second phase is to train a reward model using annotated training data. During this phase, ChatGPT samples a batch of prompts generated during the first phase and creates a number of different answers for each prompt.
Using the <prompt, answer1, answer2,…, answerN>, an annotator orders the answers based on a multidimensional criteria that incudes aspects such as relevance, informativeness, harmfulness and several others. The result dataset is used to train the reward model.
The final phase of the ChatGPT training includes a reinforcement learning (RL) method to enhance the pretrained model. The RL algorithm uses the reward model of the previous phase to update the parameters of the pretrained model.
Specifically, this phase initializes a batch of new commands from the prompts as well as the parameters used by the proximal policy optimization (PPO) model. For each prompt, the PPO model generates answers and RL model provides a score based on its efficiency.
The architecture behind GPT combines language pretrained models with very clever reinforcement learning and supervised fine tuning processes to provide a level of action understanding we haven’t seen before in this type of architectures.