RLHF 101: A Technical Tutorial on Reinforcement Studying from Human Suggestions – Machine Studying Weblog

Learn Ruth Porat’s remarks about expertise to struggle most cancers

6 Key Variations Between Machine Studying and Deep Studying: A Complete Information | by Dealonai | Jun, 2025

8 FREE Platforms to Host Machine Studying Fashions

Reinforcement Studying from Human Suggestions (RLHF) is a well-liked method used to align AI techniques with human preferences by coaching them utilizing suggestions from folks, reasonably than relying solely on predefined reward features. As an alternative of coding each fascinating habits manually (which is commonly infeasible in complicated duties) RLHF permits fashions, particularly massive language fashions (LLMs), to study from examples of what people contemplate good or unhealthy outputs. This method is especially vital for duties the place success is subjective or onerous to quantify, comparable to producing useful and protected textual content responses. RLHF has turn out to be a cornerstone in constructing extra aligned and controllable AI techniques, making it important for growing AI that behaves in methods people intend.

This weblog dives into the total coaching pipeline of the RLHF framework. We’ll discover each stage — from knowledge era and reward mannequin inference, to the ultimate coaching of an LLM. Our aim is to make sure that every thing is absolutely reproducible by offering all the mandatory code and the precise specs of the environments used. By the tip of this put up, you must know the final pipeline to coach any mannequin with any instruction dataset utilizing the RLHF algorithm of your alternative!

Preliminary: Setup & Surroundings

We’ll use the next setup for this tutorial:

Dataset: UltraFeedback, a well-curated dataset consisting of basic chat prompts. (Whereas UltraFeedback additionally incorporates LLM-generated responses to the prompts, we gained’t be utilizing these.)
Base Mannequin: Llama-3-8B-it, a state-of-the-art instruction-tuned LLM. That is the mannequin we are going to fine-tune.
Reward Mannequin: Armo, a strong reward mannequin optimized for evaluating the generated outputs. We’ll use Armo to assign scalar reward values to candidate responses, indicating how “good” or “aligned” a response is.
Coaching Algorithm: REBEL, a state-of-the-art algorithm tailor-made for environment friendly RLHF optimization.

To get began, clone our repo, which incorporates all of the sources required for this tutorial:

git clone https://github.com/ZhaolinGao/REBEL
cd REBEL

We use two separate environments for various levels of the pipeline:

vllm: Handles knowledge era, leveraging the environment friendly vllm library.
insurgent: Used for coaching the RLHF mannequin.

You’ll be able to set up each environments utilizing the offered YAML recordsdata:

conda env create -f ./envs/rebel_env.yml
conda env create -f ./envs/vllm_env.yml

Half 1: Information Era

Step one within the RLHF pipeline is producing samples from the coverage to obtain suggestions on. Concretely, on this part, we are going to load the bottom mannequin utilizing vllm for quick inference, put together the dataset, and generate a number of responses for every immediate within the dataset. The entire code for this half is offered right here.

Activate the vllm setting:

conda activate vllm

First, load the bottom mannequin and tokenizer utilizing vllm:

from transformers import AutoTokenizer
from vllm import LLM
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
llm = LLM(
    mannequin="meta-llama/Meta-Llama-3-8B-Instruct",
    tensor_parallel_size=8,
)

Right here, tensor_parallel_size specifies the variety of GPUs to make use of.

Subsequent, load the UltraFeedback dataset:

from datasets import load_dataset
dataset = load_dataset("allenai/ultrafeedback_binarized_cleaned_train", cut up="prepare")

You’ll be able to choose a subset of the dataset utilizing dataset.choose. For instance, to pick the primary 10,000 rows:

dataset = dataset.choose(vary(10000))

Alternatively, you possibly can cut up the dataset into chunks utilizing dataset.shard for implementations like SPPO the place every iteration solely trains on one of many chunks.

Now, let’s put together the dataset for era. The Llama mannequin makes use of particular tokens to differentiate prompts and responses. For instance:

<|begin_of_text|><|start_header_id|>person<|end_header_id|>

What's France's capital?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Subsequently, for each immediate within the dataset, we have to convert it from plain textual content into this format earlier than producing:

def get_message(instruction):
    message = [
        {"role": "user", "content": instruction},
    ]
    return message
prompts = [tokenizer.apply_chat_template(get_message(row['prompt']), tokenize=False, add_generation_prompt=True) for row in dataset]

get_message transforms the plain-text immediate right into a dictionary indicating it’s from the person.
tokenizer.apply_chat_template provides the required particular tokens and appends the response tokens (<|start_header_id|>assistant<|end_header_id|>nn} on the finish with add_generation_prompt=True.

Lastly, we are able to generate the responses utilizing vllm with the prompts we simply formatted. We’re going to generate 5 responses per immediate:

import torch
import random
import numpy as np
from vllm import SamplingParams

def set_seed(seed=5775709):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

for p in vary(5):
    set_seed(p * 50)
    sampling_params = SamplingParams(
        temperature=0.8,
        top_p=0.9,
        max_tokens=2048,
        seed=p * 50,
    )
    response = llm.generate(prompts, sampling_params)
    output = listing(map(lambda x: x.outputs[0].textual content, response))
    dataset = dataset.add_column(f"response_{p}", output)

temperature=0.8, top_p=0.9 are frequent settings to manage variety in era.
set_seed is used to make sure reproducibility and units a unique seed for every response.
llm.generate generates the response, and the outcomes are added to the dataset with dataset.add_column.

You might run the entire scipt with:

python ./src/ultrafeedback_largebatch/generate.py --world_size NUM_GPU --output_repo OUTPUT_REPO

Half 2: Reward Mannequin Inference

The second step within the RLHF pipeline is querying the reward mannequin to inform us how good a generated pattern was. Concretely, on this half, we are going to calculate reward scores for the responses generated in Half 1 what are later used for coaching. The entire code for this half is offered right here.

Activate the insurgent setting:

conda activate insurgent

To start, we’ll initialize the Armo reward mannequin pipeline. This reward mannequin is a fine-tuned sequence classification mannequin that assigns a scalar reward rating to a given dialogue based mostly on its high quality.

rm = ArmoRMPipeline("RLHFlow/ArmoRM-Llama3-8B-v0.1", trust_remote_code=True)

Now, we are able to collect the reward scores:

def get_message(instruction, response):
    return [{"role": "user", "content": instruction}, {"role": "assistant", "content": response}]

rewards = {}
for i in vary(5):
    rewards[f"response_{i}_reward"] = []
    for row in dataset:
        reward = rm(get_message(row['prompt'], row[f'response_{i}']))
        rewards[f"response_{i}_reward"].append(reward)
for okay, v in rewards.objects():
    dataset = dataset.add_column(okay, v)

get_message codecs the person immediate and assistant response into an inventory of dictionaries.
rm computes a reward rating for every response within the dataset.

You’ll be able to run the entire scipt with:

python ./src/ultrafeedback_largebatch/rank.py --input_repo INPUT_REPO

INPUT_REPO is the saved repo from Half 1 that incorporates the generated responses.

Half 3: Filter and Tokenize

Whereas the previous two elements are all we’d like in principle to do RLHF, it’s typically advisable in apply to carry out a filtering course of to make sure coaching runs easily. Concretely, on this half, we’ll stroll by means of the method of making ready a dataset for coaching by filtering excessively lengthy prompts and responses to stop out-of-memory (OOM) points, selecting the right and worst responses for coaching, and eradicating duplicate responses. The entire code for this half is offered right here.

Let’s first initialize two completely different tokenizers the place one pads from the suitable and one pads from the left:

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
tokenizer_left = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", padding_side="left")
tokenizer_left.add_special_tokens({"pad_token": "[PAD]"})

These two completely different tokenizers enable us to pad the immediate from left and the response from the suitable such that they meet within the center. By combining left-padded prompts with right-padded responses, we make sure that:

Prompts and responses meet at a constant place.
Relative place embeddings stay appropriate for mannequin coaching.

Right here’s an instance format:

[PAD] ... [PAD] <|begin_of_text|><|start_header_id|>person<|end_header_id|>

PROMPT<|eot_id|><|start_header_id|>assistant<|end_header_id|>


RESPONSE<|eot_id|>[PAD] ... [PAD]

We need to make sure that the size of

[PAD] ... [PAD] <|begin_of_text|><|start_header_id|>person<|end_header_id|>

PROMPT<|eot_id|><|start_header_id|>assistant<|end_header_id|>

is similar for all prompts, and the size of

RESPONSE<|eot_id|>[PAD] ... [PAD]

is similar for all responses.

We filter out prompts longer than 1,024 tokens and responses exceeding 2,048 tokens to stop OOM throughout coaching:

dataset = dataset.filter(lambda row: tokenizer.apply_chat_template(get_message(row['prompt']), tokenize=True, add_generation_prompt=True, return_tensors="pt").form[-1] <= 1024)
for i in vary(5):
    dataset = dataset.filter(lambda row: tokenizer.apply_chat_template(get_message(response=row[f'response_{i}']), tokenize=True, add_generation_prompt=False, return_tensors="pt")[:, 5:].form[-1] <= 2048)

Now we might tokenize the immediate with left padding to a most size of 1,024 tokens:

llama_prompt_tokens = []
for row in dataset:
    llama_prompt_token = tokenizer_left.apply_chat_template(
            get_message(row['prompt']), 
            add_generation_prompt=True,
            tokenize=True,
            padding='max_length',
            max_length=1024,
    )
    assert len(llama_prompt_token) == 1024
    assert (llama_prompt_token[0] == 128000 or llama_prompt_token[0] == 128256) and llama_prompt_token[-1] == 271
    llama_prompt_tokens.append(llama_prompt_token)
dataset = dataset.add_column("llama_prompt_tokens", llama_prompt_tokens)

The assertions are used to make sure that the size is at all times 1,024 and the tokenized immediate both begins with [pad] token or <|begin_of_text|> token and ends with nn token.

Then, we choose the responses with the best and lowest rewards for every immediate because the chosen and reject responses, and tokenize them with proper padding:

chosen, reject, llama_chosen_tokens, llama_reject_tokens, chosen_reward, reject_reward = [], [], [], [], [], []

for row in dataset:

    all_rewards = [row[f"response_{i}_reward"] for i in vary(5)]
    chosen_idx, reject_idx = np.argmax(all_rewards), np.argmin(all_rewards)

    chosen.append(row[f"response_{chosen_idx}"])
    reject.append(row[f"response_{reject_idx}"])

    llama_chosen_token = tokenizer.apply_chat_template(
            get_message(response=row[f"response_{chosen_idx}"]),
            add_generation_prompt=False,
            tokenize=True,
            padding='max_length',
            max_length=2048+5,
    )[5:]
    llama_chosen_tokens.append(llama_chosen_token)
    chosen_reward.append(row[f"response_{chosen_idx}_reward"])
    assert len(llama_chosen_token) == 2048
    assert llama_chosen_token[-1] == 128009 or llama_chosen_token[-1] == 128256

    llama_reject_token = tokenizer.apply_chat_template(
            get_message(response=row[f"response_{reject_idx}"]),
            add_generation_prompt=False,
            tokenize=True,
            padding='max_length',
            max_length=2048+5,
    )[5:]
    llama_reject_tokens.append(llama_reject_token)
    reject_reward.append(row[f"response_{reject_idx}_reward"])
    assert len(llama_reject_token) == 2048
    assert llama_reject_token[-1] == 128009 or llama_reject_token[-1] == 128256

dataset = dataset.add_column("chosen", chosen)
dataset = dataset.add_column("chosen_reward", chosen_reward)
dataset = dataset.add_column("llama_chosen_tokens", llama_chosen_tokens)
dataset = dataset.add_column("reject", reject)
dataset = dataset.add_column("reject_reward", reject_reward)
dataset = dataset.add_column("llama_reject_tokens", llama_reject_tokens)

Once more the assertions are used to make sure that the lengths of the tokenized responses are at all times 2,048 and the tokenized responses both finish with [pad] token or <|eot_id|> token.

Lastly, we filter out rows the place the chosen and reject responses are the identical:

dataset = dataset.filter(lambda row: row['chosen'] != row['reject'])

and cut up the dataset right into a coaching set and a take a look at set with 1,000 prompts:

dataset = dataset.train_test_split(test_size=1000, shuffle=True)

You might run the entire scipt with:

python ./src/ultrafeedback_largebatch/filter_tokenize.py --input_repo INPUT_REPO

INPUT_REPO is the saved repo from Half 2 that incorporates the rewards for every response.

Half 4: Coaching with REBEL

Lastly, we’re now able to replace the parameters of our mannequin utilizing an RLHF algorithm! We’ll now use our curated dataset and the REBEL algorithm to fine-tune our base mannequin.

At every iteration (t) of REBEL, we goal to unravel the next sq. loss regression drawback:
$$theta_{t+1}=argmin_{thetainTheta}sum_{(x, y, y’)in mathcal{D}_t}left(frac{1}{eta} left(ln fracx){pi_{theta_t}(y|x)} – ln fracx){pi_{theta_t}(y’|x)}proper) – left(r(x, y) – r(x, y’)proper)proper)^2$$

the place (eta) is a hyperparameter, (theta) is the parameter of the mannequin, (x) is the immediate, (mathcal{D}_t) is the dataset we collected from the earlier three elements, (y) and (y’) are the responses for (x), (pi_theta(y|x)) is the chance of era response (y) given immediate (x) below the parameterized coverage (pi_theta), and (r(x, y)) is the reward of response (y) for immediate (x) which is obtained from Half 2. The detailed derivations of the algorithm are proven in our paper. In brief REBEL lets us keep away from the complexity (e.g. clipping, critic fashions, …) of different RLHF algorithms like PPO whereas having stronger theoretical ensures!

On this tutorial, we show a single iteration of REBEL ((t=0)) utilizing the bottom mannequin (pi_{theta_0}). For multi-iteration coaching, you possibly can repeat Elements 1 by means of 4, initializing every iteration with the mannequin educated within the earlier iteration.

The entire code for this half is offered right here. To allow full parameter coaching utilizing 8 GPUs, we use the Speed up library with Deepspeed Stage 3 by operating:

speed up launch --config_file accelerate_cfgs/deepspeed_config_stage_3.yaml --main-process-port 29080 --num_processes 8 src/ultrafeedback_largebatch/insurgent.py --task.input_repo INPUT_REPO --output_dir OUTPUT_DIR

INPUT_REPO is the saved repo from Half 3 that incorporates the tokenized prompts and responses.
OUTPUT_DIR is the listing to save lots of the fashions.

Step 1: Initialization & Loading

We begin by initializing the batch measurement for distributed coaching:

args.world_size = accelerator.num_processes
args.batch_size = args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps
args.local_batch_size = args.per_device_train_batch_size * args.gradient_accumulation_steps
args.insurgent.num_updates = args.total_episodes // args.batch_size

args.world_size is the variety of GPUs we’re utilizing.
args.local_batch_size is the batch measurement for every GPU.
args.batch_size is the precise batch measurement for coaching.
args.insurgent.num_updates is the overall variety of updates to carry out and args.total_episodes is the variety of knowledge factors to coach for. Usually, we set args.total_episodes to be the dimensions of the coaching set for one epoch.

Subsequent, we load the mannequin and tokenizer, making certain dropout layers are disabled such that the logprobs of the generations are computed with out randomness:

tokenizer = AutoTokenizer.from_pretrained(
                args.base_model, 
                padding_side="proper",
                trust_remote_code=True,
            )
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
coverage = AutoModelForCausalLM.from_pretrained(
            args.base_model,
            trust_remote_code=True,
            torch_dtype=torch.bfloat16,
            attn_implementation="flash_attention_2",
        )
disable_dropout_in_model(coverage)

Step 2: Coaching

Trying once more on the REBEL goal, the one issues we’d like now to coach is to compute (pi_theta(y|x)) and (pi_{theta_0}(y|x)). We will compute every of them with:

output = coverage(
    input_ids=input_ids, 
    attention_mask=attention_mask,
    return_dict=True,
    output_hidden_states=True,
)
logits = output.logits[:, args.task.maxlen_prompt - 1 : -1]
logits /= args.activity.temperature + 1e-7
all_logprobs = F.log_softmax(logits, dim=-1)
logprobs = torch.collect(all_logprobs, 2, input_ids[:, args.task.maxlen_prompt:].unsqueeze(-1)).squeeze(-1)
logprobs = (logprobs * seq_mask).sum(-1)

output.logits incorporates the logits of all tokens within the vocabulary for the sequence of input_ids.
output.logits[:, args.task.maxlen_prompt - 1 : -1] is the logits of all tokens within the vocabulary for the sequence of response solely. It’s shifted by 1 because the logits at place (p) are referring to the logits at place (p+1).
We divide logits by args.activity.temperature to acquire the precise chance throughout era.
torch.collect is used to collect the angle token within the response.
mb_seq_mask masks out the paddings.

Step 4: Loss Computation

Lastly, we might compute the loss by:

reg_diff = ((pi_logprobs_y - pi_0_logprobs_y) - (pi_logprobs_y_prime - pi_0_logprobs_y_prime)) / eta - (chosen_reward - reject_reward)
loss = (reg_diff ** 2).imply()

Efficiency

With just one iteration of the above 4 elements, we are able to vastly improve the efficiency of the bottom mannequin on AlpacaEval, MT-Bench, and ArenaHard, three benchmarks generally used to guage the standard, alignment, and helpfulness of responses generated by LLMs.

Takeaway

On this put up, we outlined the pipeline for implementing RLHF, masking the whole course of from knowledge era to the precise coaching part. Whereas we centered particularly on the REBEL algorithm, this pipeline is flexible and may be readily tailored to different strategies comparable to DPO or SimPO. The required elements for these strategies are already included apart from the particular loss formulation. There’s additionally a pure extension of the above pipeline to multi-turn RLHF the place we optimize for efficiency over a complete dialog (reasonably than a single era) — try our follow-up paper right here for extra data!

For those who discover this implementation helpful, please contemplate citing our work:

@misc{gao2024rebel,
      title={REBEL: Reinforcement Studying through Regressing Relative Rewards}, 
      writer={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Solar},
      yr={2024},
      eprint={2404.16767},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

RLHF 101: A Technical Tutorial on Reinforcement Studying from Human Suggestions – Machine Studying Weblog | ML@CMU

Learn Ruth Porat’s remarks about expertise to struggle most cancers

6 Key Variations Between Machine Studying and Deep Studying: A Complete Information | by Dealonai | Jun, 2025

8 FREE Platforms to Host Machine Studying Fashions

Multi-Layer Switching and Tunneling « ipSpace.internet weblog

Asserting Gemma 3n preview: highly effective, environment friendly, mobile-first AI

Md Sazzad Hossain

Related Posts

Learn Ruth Porat’s remarks about expertise to struggle most cancers

6 Key Variations Between Machine Studying and Deep Studying: A Complete Information | by Dealonai | Jun, 2025

8 FREE Platforms to Host Machine Studying Fashions

Instructing AI fashions the broad strokes to sketch extra like people do | MIT Information

Construct GraphRAG functions utilizing Amazon Bedrock Information Bases

Asserting Gemma 3n preview: highly effective, environment friendly, mobile-first AI

Leave a Reply Cancel reply

Recommended

Revolutionizing Knowledge Safety within the Cloud Period

Discovering new Companions with Options for Have I Been Pwned Customers

Categories

CyberDefenseGo

Recent

Soham Mazumdar, Co-Founder & CEO of WisdomAI – Interview Sequence

Postman Unveils Agent Mode: AI-Native Improvement Revolutionizes API Lifecycle

Search

Welcome Back!

Retrieve your password

RLHF 101: A Technical Tutorial on Reinforcement Studying from Human Suggestions – Machine Studying Weblog | ML@CMU

You might also like

Preliminary: Setup & Surroundings

Half 1: Information Era

Half 2: Reward Mannequin Inference

Half 3: Filter and Tokenize

Half 4: Coaching with REBEL

Step 1: Initialization & Loading

Step 2: Coaching

Step 4: Loss Computation

Efficiency

Takeaway

Multi-Layer Switching and Tunneling « ipSpace.internet weblog

Asserting Gemma 3n preview: highly effective, environment friendly, mobile-first AI

Related Posts

Leave a Reply Cancel reply

Recommended

Categories

CyberDefenseGo

Recent

Search

Welcome Back!

Retrieve your password