Importing Datasets to Hugging Face: A Step-by-Step Information

Half 1: Importing a Dataset to Hugging Face Hub

Introduction

This a part of the tutorial walks you thru the method of importing a customized dataset to the Hugging Face Hub. The Hugging Face Hub is a platform that permits builders to share and collaborate on datasets and fashions for machine studying.

Right here, we’ll take an present Python instruction-following dataset, rework it right into a format appropriate for coaching the most recent Massive Language Fashions (LLMs), after which add it to Hugging Face for public use. We’re particularly formatting our knowledge to match the Llama 3.2 chat template, which makes it prepared for fine-tuning Llama 3.2 fashions.

Step 1: Set up and Authentication

First, we have to set up the required libraries and authenticate with the Hugging Face Hub:

!pip set up -q datasets
!huggingface-cli login

What’s taking place right here:

datasets is Hugging Face’s library for working with machine studying datasets
The quiet flag -q reduces set up output messages
huggingface-cli login will immediate you to enter your Hugging Face authentication token
You’ll find your token by going to your Hugging Face account settings → Entry Tokens

After operating this cell, you can be prompted to enter your token. This authenticates your session and means that you can push content material to the Hub.

Step 2: Load the Dataset and Outline the Transformation Perform

Subsequent, we’ll load an present dataset and outline a perform to remodel it to match the Llama 3.2 chat format:

from datasets import load_dataset


# Load your full customized dataset
dataset = load_dataset('Vezora/Examined-143k-Python-Alpaca')


# Outline a perform to remodel the information
def transform_conversation(instance):
   system_prompt = """
   You might be an professional Python coding assistant. Your position is to assist customers write clear, environment friendly, and bug-free Python code.
   You could have been skilled on a various set of high-quality Python code samples, all of which handed rigorous
   automated testing for performance and efficiency.


   At all times comply with greatest practices in Python programming, present concise and readable options,
   and be sure that your responses embody informative feedback when needed.
   When offered with a coding drawback, first create an in depth pseudocode that outlines the
   construction and logic of the answer step-by-step. As soon as the pseudocode is full,
   comply with it to generate the precise Python code. This method will assist guarantee
   readability and alignment with the specified logic earlier than writing the code.


   If requested to switch present code, present pseudocode highlighting the adjustments and
   optimizations to be made, specializing in enhancements associated to efficiency, error dealing with,
   and robustness. Bear in mind to clarify your thought course of and rationale clearly for
   any modifications or code strategies you present.
   """
   instruction = instance['instruction'].strip()  # Accessing the instruction column
   output = instance['output'].strip()            # Accessing the output column


   formatted_text = (
       f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
       {system_prompt}
       <|eot_id|>n<|start_header_id|>consumer<|end_header_id|>
       {instruction}
       <|eot_id|><|start_header_id|>assistant<|end_header_id|>
       {output}<|eot_id|>"""
   )
   # instruction = instance['instruction'].strip()  # Accessing the instruction column
   # output = instance['output'].strip()            # Accessing the output column


   # Apply the brand new template
   # Since there isn't any system immediate, we assemble the string with out the SYS half
   # formatted_text = f'[INST] {instruction} [/INST] {output} '


   return {'textual content': formatted_text}

What’s taking place right here:

We load the ‘Vezora/Examined-143k-Python-Alpaca’ dataset, which incorporates Python programming directions and outputs
We outline a change perform that restructures every instance into the Llama 3.2 chat format
We embody an in depth system immediate that provides the mannequin context about its position as a Python coding assistant
The particular tokens like <|begin_of_text|>, <|start_header_id|>, and <|eot_id|> are Llama 3.2’s means of formatting conversational knowledge
This perform creates a correctly formatted dialog with system, consumer, and assistant messages

The system immediate is especially essential because it defines the persona and conduct expectations for the mannequin. On this case, we’re instructing the mannequin to behave as an professional Python coding assistant that follows greatest practices and offers well-commented, environment friendly options.

Step 3: Apply the Transformation to the Dataset

Now we apply our transformation perform to all the dataset:

# Apply the transformation to all the dataset
transformed_dataset = dataset['train'].map(transform_conversation)

What’s taking place right here:

The map() perform applies our transformation perform to each instance within the dataset
This processes all 143,000 examples within the dataset, reformatting them into the Llama 3.2 chat format
The result’s a brand new dataset with the identical content material however structured correctly for fine-tuning Llama 3.2

This transformation is essential as a result of it reformats the information into the precise template required by the Llama 3.2 mannequin household. With out this formatting, the mannequin wouldn’t acknowledge the totally different roles within the dialog (system, consumer, assistant) or the place every message begins and ends.

Step 4: Add the Dataset to Hugging Face Hub

With our dataset ready, we will now add it to the Hugging Face Hub:

transformed_dataset.push_to_hub("Llama-3.2-Python-Alpaca-143k")

What’s taking place right here:

The push_to_hub() technique uploads our remodeled dataset to the Hugging Face Hub
“Llama-3.2-Python-Alpaca-143k” would be the identify of your dataset repository
This creates a brand new repository underneath your username: https://huggingface.co/datasets/YOUR_USERNAME/Llama-3.2-Python-Alpaca-143k
The dataset will now be publicly obtainable for others to obtain and use

After operating this cell, you’ll see progress bars indicating the add standing. As soon as full, you possibly can go to the Hugging Face Hub to view your newly uploaded dataset, edit its description, and share it with the neighborhood.

This dataset is now prepared for use for fine-tuning Llama 3.2 fashions on Python programming duties, with correctly formatted conversations that embody system directions, consumer queries, and assistant responses!

Half 2: Wonderful-tuning and Importing a Mannequin to Hugging Face Hub

Now that we’ve ready and uploaded our dataset, let’s transfer on to fine-tuning a mannequin and importing it to the Hugging Face Hub.

Step 1: Set up Required Libraries

First, we have to set up all the required libraries for fine-tuning massive language fashions effectively:

!pip set up "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip set up "git+https://github.com/huggingface/transformers.git"
!pip set up -U trl
!pip set up --no-deps trl peft speed up bitsandbytes
!pip set up torch torchvision torchaudio triton
!pip set up xformers
!python -m xformers.data
!python -m bitsandbytes

What this does: Installs Unsloth (a library for quicker LLM fine-tuning), the most recent model of Transformers, TRL (for reinforcement studying), PEFT (for parameter-efficient fine-tuning), and different dependencies wanted for coaching. The xformers and bitsandbytes libraries assist with reminiscence effectivity.

Step 2: Load the Dataset

Subsequent, we load the dataset we ready within the earlier part:

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
import torch
from datasets import load_dataset
max_seq_length = 2048
dataset = load_dataset("nikhiljatiwal/Llama-3.2-Python-Alpaca-143k", cut up="prepare")

What this does: Units the utmost sequence size for our mannequin and hundreds our beforehand uploaded Python coding dataset from Hugging Face.

Step 3: Load the Pre-trained Mannequin

Now we load a quantized model of Llama 3.2:

mannequin, tokenizer = FastLanguageModel.from_pretrained(
   model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
   max_seq_length = max_seq_length,
   dtype = None,
   load_in_4bit = True
)

What this does: Masses a 4-bit quantized model of Llama 3.2 3B Instruct mannequin from Unsloth’s repository. Quantization reduces the reminiscence footprint whereas sustaining a lot of the mannequin’s efficiency.

Step 4: Configure PEFT (Parameter-Environment friendly Wonderful-Tuning)

We’ll arrange the mannequin for environment friendly fine-tuning utilizing LoRA (Low-Rank Adaptation):

mannequin = FastLanguageModel.get_peft_model(
   mannequin,
   r = 16,
   target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj",],
   lora_alpha = 16,
   lora_dropout = 0, # Helps any, however = 0 is optimized
   bias = "none",    # Helps any, however = "none" is optimized
   # [NEW] "unsloth" makes use of 30% much less VRAM, matches 2x bigger batch sizes!
   use_gradient_checkpointing = "unsloth", # True or "unsloth" for very lengthy context
   random_state = 3407,
   use_rslora = False,  # We help rank stabilized LoRA
   loftq_config = None, # And LoftQ
   max_seq_length = max_seq_length
)

What this does: Configures the mannequin for Parameter-Environment friendly Wonderful-Tuning with LoRA. This method solely trains a small variety of new parameters whereas maintaining a lot of the authentic mannequin frozen, permitting environment friendly coaching with restricted assets. We’re concentrating on particular projection layers within the mannequin with a rank of 16.

Step 5: Mount Google Drive for Saving

To make sure our skilled mannequin is saved even when the session disconnects:

from google.colab import drive
drive.mount("/content material/drive")

What this does: Mounts your Google Drive to save lots of checkpoints and the ultimate mannequin.

Step 6: Set Up Coaching and Begin Coaching

Now we configure and begin the coaching course of:

coach = SFTTrainer(
   mannequin = mannequin,
   train_dataset = dataset,
   dataset_text_field = "textual content",
   max_seq_length = max_seq_length,
   tokenizer = tokenizer,
   args = TrainingArguments(
       per_device_train_batch_size = 2,
       gradient_accumulation_steps = 4,
       warmup_steps = 10,
       # num_train_epochs = 1, # Set this for 1 full coaching run.
       max_steps = 60,
       learning_rate = 2e-4,
       fp16 = not torch.cuda.is_bf16_supported(),
       bf16 = torch.cuda.is_bf16_supported(),
       logging_steps = 1,
       optim = "adamw_8bit",
       weight_decay = 0.01,
       lr_scheduler_type = "linear",
       seed = 3407,
       output_dir = "/content material/drive/My Drive/Llama-3.2-3B-Instruct-bnb-4bit"
   ),
)


coach.prepare()

What this does: Creates a Supervised Wonderful-Tuning Coach with our mannequin, dataset, and coaching parameters. The coaching runs for 60 steps with a batch dimension of two, gradient accumulation of 4, and a studying price of 2e-4. The mannequin checkpoints will likely be saved to Google Drive.

Step 7: Save the Wonderful-tuned Mannequin Domestically

After coaching, we save our mannequin:

mannequin.save_pretrained("lora_model") # Native saving
tokenizer.save_pretrained("lora_model")

What this does: Saves the fine-tuned LoRA mannequin and tokenizer to an area listing.

Step 8: Add the Mannequin to Hugging Face Hub

Lastly, we add our fine-tuned mannequin to Hugging Face:

import os
from google.colab import userdata


HF_TOKEN = userdata.get('HF_WRITE_API_KEY')


mannequin.push_to_hub_merged("nikhiljatiwal/Llama-3.2-3B-Instruct-code-bnb-4bit", tokenizer, save_method = "merged_16bit", token=HF_TOKEN)

On this information, we demonstrated an entire workflow for AI mannequin customization utilizing Hugging Face. We remodeled a Python instruction dataset into Llama 3.2 format with a specialised system immediate and uploaded it as “Llama-3.2-Python-Alpaca-143k”. We then fine-tuned a Llama 3.2 mannequin utilizing environment friendly methods (4-bit quantization and LoRA) with minimal computing assets. Lastly, we shared each assets on Hugging Face Hub, making our Python coding assistant obtainable to the neighborhood. This undertaking showcases how accessible AI improvement has turn out to be, enabling builders to create specialised fashions for particular duties with comparatively modest assets.

Right here is the Colab Notebook_Llama_3_2_3B_Instruct_code and Colab Notebook_Llama_3_2_Python_Alpaca_143k . Additionally, don’t neglect to comply with us on Twitter and be a part of our Telegram Channel and LinkedIn Group. Don’t Overlook to hitch our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Digital Convention on AGENTIC AI: FREE REGISTRATION + Certificates of Attendance + 4 Hour Quick Occasion (Might 21, 9 am- 1 pm PST) + Palms on Workshop

Nikhil is an intern guide at Marktechpost. He’s pursuing an built-in twin diploma in Supplies on the Indian Institute of Expertise, Kharagpur. Nikhil is an AI/ML fanatic who’s at all times researching purposes in fields like biomaterials and biomedical science. With a powerful background in Materials Science, he’s exploring new developments and creating alternatives to contribute.

Importing Datasets to Hugging Face: A Step-by-Step Information

Why Creators Are Craving Unfiltered AI Video Mills

6 New ChatGPT Tasks Options You Have to Know

combining generative AI with live-action filmmaking

When Predictors Collide: Mastering VIF in Multicollinear Regression

Mustang Panda Targets Myanmar With StarProxy, EDR Bypass, and TONESHELL Updates

Md Sazzad Hossain

Related Posts

Why Creators Are Craving Unfiltered AI Video Mills

6 New ChatGPT Tasks Options You Have to Know

combining generative AI with live-action filmmaking

Photonic processor may streamline 6G wi-fi sign processing | MIT Information

Construct a Safe AI Code Execution Workflow Utilizing Daytona SDK

Mustang Panda Targets Myanmar With StarProxy, EDR Bypass, and TONESHELL Updates

Leave a Reply Cancel reply

Recommended

How you can Clear and Preserve a Dehumidifier

Who’s the Finest AI Professional. 10 Finest AI Specialists Revealed | by Ibtissam Hammadi | Jan, 2025

Categories

CyberDefenseGo

Recent

Discord Invite Hyperlink Hijacking Delivers AsyncRAT and Skuld Stealer Concentrating on Crypto Wallets

How A lot Does Mould Elimination Value in 2025?

Search

Welcome Back!

Retrieve your password

Importing Datasets to Hugging Face: A Step-by-Step Information

You might also like

Half 1: Importing a Dataset to Hugging Face Hub

Introduction

Step 1: Set up and Authentication

Step 2: Load the Dataset and Outline the Transformation Perform

Step 3: Apply the Transformation to the Dataset

Step 4: Add the Dataset to Hugging Face Hub

Half 2: Wonderful-tuning and Importing a Mannequin to Hugging Face Hub

Step 1: Set up Required Libraries

Step 2: Load the Dataset

Step 3: Load the Pre-trained Mannequin

Step 4: Configure PEFT (Parameter-Environment friendly Wonderful-Tuning)

Step 5: Mount Google Drive for Saving

Step 6: Set Up Coaching and Begin Coaching

Step 7: Save the Wonderful-tuned Mannequin Domestically

Step 8: Add the Mannequin to Hugging Face Hub

When Predictors Collide: Mastering VIF in Multicollinear Regression

Mustang Panda Targets Myanmar With StarProxy, EDR Bypass, and TONESHELL Updates

Related Posts

Leave a Reply Cancel reply

Recommended

Categories

CyberDefenseGo

Recent

Search

Welcome Back!

Retrieve your password