Reinforcement Studying from Human Suggestions (RLHF) is a well-liked method used to align AI techniques with human preferences by coaching them utilizing suggestions from folks, reasonably than relying solely on predefined reward features. As an alternative of coding each fascinating habits manually (which is commonly infeasible in complicated duties) RLHF permits fashions, particularly massive language fashions (LLMs), to study from examples of what people contemplate good or unhealthy outputs. This method is especially vital for duties the place success is subjective or onerous to quantify, comparable to producing useful and protected textual content responses. RLHF has turn out to be a cornerstone in constructing extra aligned and controllable AI techniques, making it important for growing AI that behaves in methods people intend.
This weblog dives into the total coaching pipeline of the RLHF framework. We’ll discover each stage — from knowledge era and reward mannequin inference, to the ultimate coaching of an LLM. Our aim is to make sure that every thing is absolutely reproducible by offering all the mandatory code and the precise specs of the environments used. By the tip of this put up, you must know the final pipeline to coach any mannequin with any instruction dataset utilizing the RLHF algorithm of your alternative!
Preliminary: Setup & Surroundings
We’ll use the next setup for this tutorial:
- Dataset: UltraFeedback, a well-curated dataset consisting of basic chat prompts. (Whereas UltraFeedback additionally incorporates LLM-generated responses to the prompts, we gained’t be utilizing these.)
- Base Mannequin: Llama-3-8B-it, a state-of-the-art instruction-tuned LLM. That is the mannequin we are going to fine-tune.
- Reward Mannequin: Armo, a strong reward mannequin optimized for evaluating the generated outputs. We’ll use Armo to assign scalar reward values to candidate responses, indicating how “good” or “aligned” a response is.
- Coaching Algorithm: REBEL, a state-of-the-art algorithm tailor-made for environment friendly RLHF optimization.
To get began, clone our repo, which incorporates all of the sources required for this tutorial:
git clone https://github.com/ZhaolinGao/REBEL cd REBEL
We use two separate environments for various levels of the pipeline:
vllm
: Handles knowledge era, leveraging the environment friendly vllm library.insurgent
: Used for coaching the RLHF mannequin.
You’ll be able to set up each environments utilizing the offered YAML recordsdata:
conda env create -f ./envs/rebel_env.yml conda env create -f ./envs/vllm_env.yml
Half 1: Information Era
Step one within the RLHF pipeline is producing samples from the coverage to obtain suggestions on. Concretely, on this part, we are going to load the bottom mannequin utilizing vllm
for quick inference, put together the dataset, and generate a number of responses for every immediate within the dataset. The entire code for this half is offered right here.
Activate the vllm
setting:
conda activate vllm
First, load the bottom mannequin and tokenizer utilizing vllm
:
from transformers import AutoTokenizer from vllm import LLM tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") llm = LLM( mannequin="meta-llama/Meta-Llama-3-8B-Instruct", tensor_parallel_size=8, )
Right here, tensor_parallel_size
specifies the variety of GPUs to make use of.
Subsequent, load the UltraFeedback dataset:
from datasets import load_dataset dataset = load_dataset("allenai/ultrafeedback_binarized_cleaned_train", cut up="prepare")
You’ll be able to choose a subset of the dataset utilizing dataset.choose
. For instance, to pick the primary 10,000 rows:
dataset = dataset.choose(vary(10000))
Alternatively, you possibly can cut up the dataset into chunks utilizing dataset.shard
for implementations like SPPO the place every iteration solely trains on one of many chunks.
Now, let’s put together the dataset for era. The Llama mannequin makes use of particular tokens to differentiate prompts and responses. For instance:
<|begin_of_text|><|start_header_id|>person<|end_header_id|> What's France's capital?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Subsequently, for each immediate within the dataset, we have to convert it from plain textual content into this format earlier than producing:
def get_message(instruction): message = [ {"role": "user", "content": instruction}, ] return message prompts = [tokenizer.apply_chat_template(get_message(row['prompt']), tokenize=False, add_generation_prompt=True) for row in dataset]
get_message
transforms the plain-text immediate right into a dictionary indicating it’s from the person.tokenizer.apply_chat_template
provides the required particular tokens and appends the response tokens (<|start_header_id|>assistant<|end_header_id|>nn} on the finish withadd_generation_prompt=True
.
Lastly, we are able to generate the responses utilizing vllm
with the prompts we simply formatted. We’re going to generate 5 responses per immediate:
import torch import random import numpy as np from vllm import SamplingParams def set_seed(seed=5775709): random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) for p in vary(5): set_seed(p * 50) sampling_params = SamplingParams( temperature=0.8, top_p=0.9, max_tokens=2048, seed=p * 50, ) response = llm.generate(prompts, sampling_params) output = listing(map(lambda x: x.outputs[0].textual content, response)) dataset = dataset.add_column(f"response_{p}", output)
temperature=0.8, top_p=0.9
are frequent settings to manage variety in era.set_seed
is used to make sure reproducibility and units a unique seed for every response.llm.generate
generates the response, and the outcomes are added to the dataset withdataset.add_column
.
You might run the entire scipt with:
python ./src/ultrafeedback_largebatch/generate.py --world_size NUM_GPU --output_repo OUTPUT_REPO
Half 2: Reward Mannequin Inference
The second step within the RLHF pipeline is querying the reward mannequin to inform us how good a generated pattern was. Concretely, on this half, we are going to calculate reward scores for the responses generated in Half 1 what are later used for coaching. The entire code for this half is offered right here.
Activate the insurgent
setting:
conda activate insurgent
To start, we’ll initialize the Armo reward mannequin pipeline. This reward mannequin is a fine-tuned sequence classification mannequin that assigns a scalar reward rating to a given dialogue based mostly on its high quality.
rm = ArmoRMPipeline("RLHFlow/ArmoRM-Llama3-8B-v0.1", trust_remote_code=True)
Now, we are able to collect the reward scores:
def get_message(instruction, response): return [{"role": "user", "content": instruction}, {"role": "assistant", "content": response}] rewards = {} for i in vary(5): rewards[f"response_{i}_reward"] = [] for row in dataset: reward = rm(get_message(row['prompt'], row[f'response_{i}'])) rewards[f"response_{i}_reward"].append(reward) for okay, v in rewards.objects(): dataset = dataset.add_column(okay, v)
get_message
codecs the person immediate and assistant response into an inventory of dictionaries.rm
computes a reward rating for every response within the dataset.
You’ll be able to run the entire scipt with:
python ./src/ultrafeedback_largebatch/rank.py --input_repo INPUT_REPO
INPUT_REPO
is the saved repo from Half 1 that incorporates the generated responses.
Half 3: Filter and Tokenize
Whereas the previous two elements are all we’d like in principle to do RLHF, it’s typically advisable in apply to carry out a filtering course of to make sure coaching runs easily. Concretely, on this half, we’ll stroll by means of the method of making ready a dataset for coaching by filtering excessively lengthy prompts and responses to stop out-of-memory (OOM) points, selecting the right and worst responses for coaching, and eradicating duplicate responses. The entire code for this half is offered right here.
Let’s first initialize two completely different tokenizers the place one pads from the suitable and one pads from the left:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") tokenizer.add_special_tokens({"pad_token": "[PAD]"}) tokenizer_left = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", padding_side="left") tokenizer_left.add_special_tokens({"pad_token": "[PAD]"})
These two completely different tokenizers enable us to pad the immediate from left and the response from the suitable such that they meet within the center. By combining left-padded prompts with right-padded responses, we make sure that:
- Prompts and responses meet at a constant place.
- Relative place embeddings stay appropriate for mannequin coaching.
Right here’s an instance format:
[PAD] ... [PAD] <|begin_of_text|><|start_header_id|>person<|end_header_id|> PROMPT<|eot_id|><|start_header_id|>assistant<|end_header_id|> RESPONSE<|eot_id|>[PAD] ... [PAD]
We need to make sure that the size of
[PAD] ... [PAD] <|begin_of_text|><|start_header_id|>person<|end_header_id|> PROMPT<|eot_id|><|start_header_id|>assistant<|end_header_id|>
is similar for all prompts, and the size of
RESPONSE<|eot_id|>[PAD] ... [PAD]
is similar for all responses.
We filter out prompts longer than 1,024 tokens and responses exceeding 2,048 tokens to stop OOM throughout coaching:
dataset = dataset.filter(lambda row: tokenizer.apply_chat_template(get_message(row['prompt']), tokenize=True, add_generation_prompt=True, return_tensors="pt").form[-1] <= 1024) for i in vary(5): dataset = dataset.filter(lambda row: tokenizer.apply_chat_template(get_message(response=row[f'response_{i}']), tokenize=True, add_generation_prompt=False, return_tensors="pt")[:, 5:].form[-1] <= 2048)
Observe that we skip the primary 5 tokens of responses when counting lengths to exclude particular tokens (e.g. <|begin_of_text|><|start_header_id|>assistant<|end_header_id|>nn) and solely depend the precise size of the response plus the EOS token (<|eot_id|>) on the finish.
Now we might tokenize the immediate with left padding to a most size of 1,024 tokens:
llama_prompt_tokens = [] for row in dataset: llama_prompt_token = tokenizer_left.apply_chat_template( get_message(row['prompt']), add_generation_prompt=True, tokenize=True, padding='max_length', max_length=1024, ) assert len(llama_prompt_token) == 1024 assert (llama_prompt_token[0] == 128000 or llama_prompt_token[0] == 128256) and llama_prompt_token[-1] == 271 llama_prompt_tokens.append(llama_prompt_token) dataset = dataset.add_column("llama_prompt_tokens", llama_prompt_tokens)
The assertions are used to make sure that the size is at all times 1,024 and the tokenized immediate both begins with [pad]
token or <|begin_of_text|>
token and ends with nn
token.
Then, we choose the responses with the best and lowest rewards for every immediate because the chosen and reject responses, and tokenize them with proper padding:
chosen, reject, llama_chosen_tokens, llama_reject_tokens, chosen_reward, reject_reward = [], [], [], [], [], [] for row in dataset: all_rewards = [row[f"response_{i}_reward"] for i in vary(5)] chosen_idx, reject_idx = np.argmax(all_rewards), np.argmin(all_rewards) chosen.append(row[f"response_{chosen_idx}"]) reject.append(row[f"response_{reject_idx}"]) llama_chosen_token = tokenizer.apply_chat_template( get_message(response=row[f"response_{chosen_idx}"]), add_generation_prompt=False, tokenize=True, padding='max_length', max_length=2048+5, )[5:] llama_chosen_tokens.append(llama_chosen_token) chosen_reward.append(row[f"response_{chosen_idx}_reward"]) assert len(llama_chosen_token) == 2048 assert llama_chosen_token[-1] == 128009 or llama_chosen_token[-1] == 128256 llama_reject_token = tokenizer.apply_chat_template( get_message(response=row[f"response_{reject_idx}"]), add_generation_prompt=False, tokenize=True, padding='max_length', max_length=2048+5, )[5:] llama_reject_tokens.append(llama_reject_token) reject_reward.append(row[f"response_{reject_idx}_reward"]) assert len(llama_reject_token) == 2048 assert llama_reject_token[-1] == 128009 or llama_reject_token[-1] == 128256 dataset = dataset.add_column("chosen", chosen) dataset = dataset.add_column("chosen_reward", chosen_reward) dataset = dataset.add_column("llama_chosen_tokens", llama_chosen_tokens) dataset = dataset.add_column("reject", reject) dataset = dataset.add_column("reject_reward", reject_reward) dataset = dataset.add_column("llama_reject_tokens", llama_reject_tokens)
Once more the assertions are used to make sure that the lengths of the tokenized responses are at all times 2,048 and the tokenized responses both finish with [pad]
token or <|eot_id|>
token.
Lastly, we filter out rows the place the chosen and reject responses are the identical:
dataset = dataset.filter(lambda row: row['chosen'] != row['reject'])
and cut up the dataset right into a coaching set and a take a look at set with 1,000 prompts:
dataset = dataset.train_test_split(test_size=1000, shuffle=True)
You might run the entire scipt with:
python ./src/ultrafeedback_largebatch/filter_tokenize.py --input_repo INPUT_REPO
INPUT_REPO
is the saved repo from Half 2 that incorporates the rewards for every response.
Half 4: Coaching with REBEL
Lastly, we’re now able to replace the parameters of our mannequin utilizing an RLHF algorithm! We’ll now use our curated dataset and the REBEL algorithm to fine-tune our base mannequin.
At every iteration (t) of REBEL, we goal to unravel the next sq. loss regression drawback:
$$theta_{t+1}=argmin_{thetainTheta}sum_{(x, y, y’)in mathcal{D}_t}left(frac{1}{eta} left(ln fracx){pi_{theta_t}(y|x)} – ln fracx){pi_{theta_t}(y’|x)}proper) – left(r(x, y) – r(x, y’)proper)proper)^2$$
the place (eta) is a hyperparameter, (theta) is the parameter of the mannequin, (x) is the immediate, (mathcal{D}_t) is the dataset we collected from the earlier three elements, (y) and (y’) are the responses for (x), (pi_theta(y|x)) is the chance of era response (y) given immediate (x) below the parameterized coverage (pi_theta), and (r(x, y)) is the reward of response (y) for immediate (x) which is obtained from Half 2. The detailed derivations of the algorithm are proven in our paper. In brief REBEL lets us keep away from the complexity (e.g. clipping, critic fashions, …) of different RLHF algorithms like PPO whereas having stronger theoretical ensures!
On this tutorial, we show a single iteration of REBEL ((t=0)) utilizing the bottom mannequin (pi_{theta_0}). For multi-iteration coaching, you possibly can repeat Elements 1 by means of 4, initializing every iteration with the mannequin educated within the earlier iteration.
The entire code for this half is offered right here. To allow full parameter coaching utilizing 8 GPUs, we use the Speed up library with Deepspeed Stage 3 by operating:
speed up launch --config_file accelerate_cfgs/deepspeed_config_stage_3.yaml --main-process-port 29080 --num_processes 8 src/ultrafeedback_largebatch/insurgent.py --task.input_repo INPUT_REPO --output_dir OUTPUT_DIR
INPUT_REPO
is the saved repo from Half 3 that incorporates the tokenized prompts and responses.OUTPUT_DIR
is the listing to save lots of the fashions.
Step 1: Initialization & Loading
We begin by initializing the batch measurement for distributed coaching:
args.world_size = accelerator.num_processes args.batch_size = args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps args.local_batch_size = args.per_device_train_batch_size * args.gradient_accumulation_steps args.insurgent.num_updates = args.total_episodes // args.batch_size
args.world_size
is the variety of GPUs we’re utilizing.args.local_batch_size
is the batch measurement for every GPU.args.batch_size
is the precise batch measurement for coaching.args.insurgent.num_updates
is the overall variety of updates to carry out andargs.total_episodes
is the variety of knowledge factors to coach for. Usually, we setargs.total_episodes
to be the dimensions of the coaching set for one epoch.
Subsequent, we load the mannequin and tokenizer, making certain dropout layers are disabled such that the logprobs of the generations are computed with out randomness:
tokenizer = AutoTokenizer.from_pretrained( args.base_model, padding_side="proper", trust_remote_code=True, ) tokenizer.add_special_tokens({"pad_token": "[PAD]"}) coverage = AutoModelForCausalLM.from_pretrained( args.base_model, trust_remote_code=True, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) disable_dropout_in_model(coverage)
Step 2: Coaching
Trying once more on the REBEL goal, the one issues we’d like now to coach is to compute (pi_theta(y|x)) and (pi_{theta_0}(y|x)). We will compute every of them with:
output = coverage( input_ids=input_ids, attention_mask=attention_mask, return_dict=True, output_hidden_states=True, ) logits = output.logits[:, args.task.maxlen_prompt - 1 : -1] logits /= args.activity.temperature + 1e-7 all_logprobs = F.log_softmax(logits, dim=-1) logprobs = torch.collect(all_logprobs, 2, input_ids[:, args.task.maxlen_prompt:].unsqueeze(-1)).squeeze(-1) logprobs = (logprobs * seq_mask).sum(-1)
output.logits
incorporates the logits of all tokens within the vocabulary for the sequence ofinput_ids
.output.logits[:, args.task.maxlen_prompt - 1 : -1]
is the logits of all tokens within the vocabulary for the sequence of response solely. It’s shifted by 1 because the logits at place (p) are referring to the logits at place (p+1).- We divide
logits
byargs.activity.temperature
to acquire the precise chance throughout era. torch.collect
is used to collect the angle token within the response.mb_seq_mask
masks out the paddings.
Step 4: Loss Computation
Lastly, we might compute the loss by:
reg_diff = ((pi_logprobs_y - pi_0_logprobs_y) - (pi_logprobs_y_prime - pi_0_logprobs_y_prime)) / eta - (chosen_reward - reject_reward) loss = (reg_diff ** 2).imply()
Efficiency
With just one iteration of the above 4 elements, we are able to vastly improve the efficiency of the bottom mannequin on AlpacaEval, MT-Bench, and ArenaHard, three benchmarks generally used to guage the standard, alignment, and helpfulness of responses generated by LLMs.
Takeaway
On this put up, we outlined the pipeline for implementing RLHF, masking the whole course of from knowledge era to the precise coaching part. Whereas we centered particularly on the REBEL algorithm, this pipeline is flexible and may be readily tailored to different strategies comparable to DPO or SimPO. The required elements for these strategies are already included apart from the particular loss formulation. There’s additionally a pure extension of the above pipeline to multi-turn RLHF the place we optimize for efficiency over a complete dialog (reasonably than a single era) — try our follow-up paper right here for extra data!
For those who discover this implementation helpful, please contemplate citing our work:
@misc{gao2024rebel, title={REBEL: Reinforcement Studying through Regressing Relative Rewards}, writer={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Solar}, yr={2024}, eprint={2404.16767}, archivePrefix={arXiv}, primaryClass={cs.LG} }