A Step-by-Step Information to Setting Up a Customized BPE Tokenizer with Tiktoken for Superior NLP Functions in Python

On this tutorial, we’ll discover ways to create a customized tokenizer utilizing the tiktoken library. The method includes loading a pre-trained tokenizer mannequin, defining each base and particular tokens, initializing the tokenizer with a selected common expression for token splitting, and testing its performance by encoding and decoding some pattern textual content. This setup is crucial for NLP duties requiring exact management over textual content tokenization.

from pathlib import Path
import tiktoken
from tiktoken.load import load_tiktoken_bpe
import json

Right here, we import a number of libraries important for textual content processing and machine studying. It makes use of Path from pathlib for simple file path administration, whereas tiktoken and load_tiktoken_bpe facilitate loading and dealing with a Byte Pair Encoding tokenizer.

tokenizer_path = "./content material/tokenizer.mannequin"
num_reserved_special_tokens = 256


mergeable_ranks = load_tiktoken_bpe(tokenizer_path)


num_base_tokens = len(mergeable_ranks)
special_tokens = [
    "<|begin_of_text|>",
    "<|end_of_text|>",
    "<|reserved_special_token_0|>",
    "<|reserved_special_token_1|>",
    "<|finetune_right_pad_id|>",
    "<|step_id|>",
    "<|start_header_id|>",
    "<|end_header_id|>",
    "<|eom_id|>",
    "<|eot_id|>",
    "<|python_tag|>",
]

Right here, we set the trail to the tokenizer mannequin, specifying 256 reserved particular tokens. It then masses the mergeable ranks, which kind the bottom vocabulary, calculates the variety of base tokens, and defines an inventory of particular tokens for marking textual content boundaries and different reserved functions.

reserved_tokens = [
    f"<|reserved_special_token_{2 + i}|>"
    for i in range(num_reserved_special_tokens - len(special_tokens))
]
special_tokens = special_tokens + reserved_tokens


tokenizer = tiktoken.Encoding(
    identify=Path(tokenizer_path).identify,
    pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^rnp{L}p{N}]?p{L}+|p{N}{1,3}| ?[^sp{L}p{N}]+[rn]*|s*[rn]+|s+(?!S)|s+",
    mergeable_ranks=mergeable_ranks,
    special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},
)

Now, we dynamically create extra reserved tokens to succeed in 256, then append them to the predefined particular tokens listing. It initializes the tokenizer utilizing tiktoken. Encoding with a specified common expression for splitting textual content, the loaded mergeable ranks as the bottom vocabulary, and mapping particular tokens to distinctive token IDs.

#-------------------------------------------------------------------------
# Take a look at the tokenizer with a pattern textual content
#-------------------------------------------------------------------------
sample_text = "Good day, this can be a take a look at of the up to date tokenizer!"
encoded = tokenizer.encode(sample_text)
decoded = tokenizer.decode(encoded)


print("Pattern Textual content:", sample_text)
print("Encoded Tokens:", encoded)
print("Decoded Textual content:", decoded)

We take a look at the tokenizer by encoding a pattern textual content into token IDs after which decoding these IDs again into textual content. It prints the unique textual content, the encoded tokens, and the decoded textual content to verify that the tokenizer works accurately.

Right here, we encode the string “Hey” into its corresponding token IDs utilizing the tokenizer’s encoding methodology.

In conclusion, following this tutorial will train you methods to arrange a customized BPE tokenizer utilizing the TikToken library. You noticed methods to load a pre-trained tokenizer mannequin, outline each base and particular tokens, and initialize the tokenizer with a selected common expression for token splitting. Lastly, you verified the tokenizer’s performance by encoding and decoding pattern textual content. This setup is a basic step for any NLP venture that requires custom-made textual content processing and tokenization.

Right here is the Colab Pocket book for the above venture. Additionally, don’t overlook to comply with us on Twitter and be part of our Telegram Channel and LinkedIn Group. Don’t Neglect to hitch our 75k+ ML SubReddit.

🚨 Advisable Open-Supply AI Platform: ‘IntellAgent is a An Open-Supply Multi-Agent Framework to Consider Advanced Conversational AI System’ _(Promoted)

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

🚨 Advisable Open-Supply AI Platform: ‘IntellAgent is a An Open-Supply Multi-Agent Framework to Consider Advanced Conversational AI System’ (Promoted)

A Step-by-Step Information to Setting Up a Customized BPE Tokenizer with Tiktoken for Superior NLP Functions in Python

Why Creators Are Craving Unfiltered AI Video Mills

6 New ChatGPT Tasks Options You Have to Know

combining generative AI with live-action filmmaking

Practically a 12 months Later, Mozilla is Nonetheless Selling OneRep – Krebs on Safety

Landmark Regulation Comes into Power – IT Connection

Md Sazzad Hossain

Related Posts

Why Creators Are Craving Unfiltered AI Video Mills

6 New ChatGPT Tasks Options You Have to Know

combining generative AI with live-action filmmaking

Photonic processor may streamline 6G wi-fi sign processing | MIT Information

Construct a Safe AI Code Execution Workflow Utilizing Daytona SDK

Landmark Regulation Comes into Power – IT Connection

Leave a Reply Cancel reply

Recommended

AI’s Multilingual Failure: NewsGuard Audit Finds Highest Failure Charges in Russian and Chinese language

Backdoor in Chinese language-made healthcare monitoring system leaks affected person information

Categories

CyberDefenseGo

Recent

Discord Invite Hyperlink Hijacking Delivers AsyncRAT and Skuld Stealer Concentrating on Crypto Wallets

How A lot Does Mould Elimination Value in 2025?

Search

Welcome Back!

Retrieve your password

A Step-by-Step Information to Setting Up a Customized BPE Tokenizer with Tiktoken for Superior NLP Functions in Python

You might also like

Practically a 12 months Later, Mozilla is Nonetheless Selling OneRep – Krebs on Safety

Landmark Regulation Comes into Power – IT Connection

Related Posts

Leave a Reply Cancel reply

Recommended

Categories

CyberDefenseGo

Recent

Search

Welcome Back!

Retrieve your password