• About
  • Disclaimer
  • Privacy Policy
  • Contact
Saturday, June 14, 2025
Cyber Defense GO
  • Login
  • Home
  • Cyber Security
  • Artificial Intelligence
  • Machine Learning
  • Data Analysis
  • Computer Networking
  • Disaster Restoration
No Result
View All Result
  • Home
  • Cyber Security
  • Artificial Intelligence
  • Machine Learning
  • Data Analysis
  • Computer Networking
  • Disaster Restoration
No Result
View All Result
Cyber Defense Go
No Result
View All Result
Home Artificial Intelligence

A Step-by-Step Information to Setting Up a Customized BPE Tokenizer with Tiktoken for Superior NLP Functions in Python

Md Sazzad Hossain by Md Sazzad Hossain
0
A Step-by-Step Information to Setting Up a Customized BPE Tokenizer with Tiktoken for Superior NLP Functions in Python
585
SHARES
3.2k
VIEWS
Share on FacebookShare on Twitter

You might also like

Why Creators Are Craving Unfiltered AI Video Mills

6 New ChatGPT Tasks Options You Have to Know

combining generative AI with live-action filmmaking


On this tutorial, we’ll discover ways to create a customized tokenizer utilizing the tiktoken library. The method includes loading a pre-trained tokenizer mannequin, defining each base and particular tokens, initializing the tokenizer with a selected common expression for token splitting, and testing its performance by encoding and decoding some pattern textual content. This setup is crucial for NLP duties requiring exact management over textual content tokenization.

from pathlib import Path
import tiktoken
from tiktoken.load import load_tiktoken_bpe
import json

Right here, we import a number of libraries important for textual content processing and machine studying. It makes use of Path from pathlib for simple file path administration, whereas tiktoken and load_tiktoken_bpe facilitate loading and dealing with a Byte Pair Encoding tokenizer.

tokenizer_path = "./content material/tokenizer.mannequin"
num_reserved_special_tokens = 256


mergeable_ranks = load_tiktoken_bpe(tokenizer_path)


num_base_tokens = len(mergeable_ranks)
special_tokens = [
    "<|begin_of_text|>",
    "<|end_of_text|>",
    "<|reserved_special_token_0|>",
    "<|reserved_special_token_1|>",
    "<|finetune_right_pad_id|>",
    "<|step_id|>",
    "<|start_header_id|>",
    "<|end_header_id|>",
    "<|eom_id|>",
    "<|eot_id|>",
    "<|python_tag|>",
]

Right here, we set the trail to the tokenizer mannequin, specifying 256 reserved particular tokens. It then masses the mergeable ranks, which kind the bottom vocabulary, calculates the variety of base tokens, and defines an inventory of particular tokens for marking textual content boundaries and different reserved functions.

reserved_tokens = [
    f"<|reserved_special_token_{2 + i}|>"
    for i in range(num_reserved_special_tokens - len(special_tokens))
]
special_tokens = special_tokens + reserved_tokens


tokenizer = tiktoken.Encoding(
    identify=Path(tokenizer_path).identify,
    pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^rnp{L}p{N}]?p{L}+|p{N}{1,3}| ?[^sp{L}p{N}]+[rn]*|s*[rn]+|s+(?!S)|s+",
    mergeable_ranks=mergeable_ranks,
    special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},
)

Now, we dynamically create extra reserved tokens to succeed in 256, then append them to the predefined particular tokens listing. It initializes the tokenizer utilizing tiktoken. Encoding with a specified common expression for splitting textual content, the loaded mergeable ranks as the bottom vocabulary, and mapping particular tokens to distinctive token IDs.

#-------------------------------------------------------------------------
# Take a look at the tokenizer with a pattern textual content
#-------------------------------------------------------------------------
sample_text = "Good day, this can be a take a look at of the up to date tokenizer!"
encoded = tokenizer.encode(sample_text)
decoded = tokenizer.decode(encoded)


print("Pattern Textual content:", sample_text)
print("Encoded Tokens:", encoded)
print("Decoded Textual content:", decoded)

We take a look at the tokenizer by encoding a pattern textual content into token IDs after which decoding these IDs again into textual content. It prints the unique textual content, the encoded tokens, and the decoded textual content to verify that the tokenizer works accurately.

Right here, we encode the string “Hey” into its corresponding token IDs utilizing the tokenizer’s encoding methodology.

In conclusion, following this tutorial will train you methods to arrange a customized BPE tokenizer utilizing the TikToken library. You noticed methods to load a pre-trained tokenizer mannequin, outline each base and particular tokens, and initialize the tokenizer with a selected common expression for token splitting. Lastly, you verified the tokenizer’s performance by encoding and decoding pattern textual content. This setup is a basic step for any NLP venture that requires custom-made textual content processing and tokenization.


Right here is the Colab Pocket book for the above venture. Additionally, don’t overlook to comply with us on Twitter and be part of our Telegram Channel and LinkedIn Group. Don’t Neglect to hitch our 75k+ ML SubReddit.

🚨 Advisable Open-Supply AI Platform: ‘IntellAgent is a An Open-Supply Multi-Agent Framework to Consider Advanced Conversational AI System’ (Promoted)


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

🚨 Advisable Open-Supply AI Platform: ‘IntellAgent is a An Open-Supply Multi-Agent Framework to Consider Advanced Conversational AI System’ (Promoted)
Tags: AdvancedApplicationsBPEcustomGuideNLPPythonSettingStepbyStepTiktokenTokenizer
Previous Post

Practically a 12 months Later, Mozilla is Nonetheless Selling OneRep – Krebs on Safety

Next Post

Landmark Regulation Comes into Power – IT Connection

Md Sazzad Hossain

Md Sazzad Hossain

Related Posts

Why Creators Are Craving Unfiltered AI Video Mills
Artificial Intelligence

Why Creators Are Craving Unfiltered AI Video Mills

by Md Sazzad Hossain
June 14, 2025
6 New ChatGPT Tasks Options You Have to Know
Artificial Intelligence

6 New ChatGPT Tasks Options You Have to Know

by Md Sazzad Hossain
June 14, 2025
combining generative AI with live-action filmmaking
Artificial Intelligence

combining generative AI with live-action filmmaking

by Md Sazzad Hossain
June 14, 2025
Photonic processor may streamline 6G wi-fi sign processing | MIT Information
Artificial Intelligence

Photonic processor may streamline 6G wi-fi sign processing | MIT Information

by Md Sazzad Hossain
June 13, 2025
Construct a Safe AI Code Execution Workflow Utilizing Daytona SDK
Artificial Intelligence

Construct a Safe AI Code Execution Workflow Utilizing Daytona SDK

by Md Sazzad Hossain
June 13, 2025
Next Post
How an Unknown Chinese language Startup Stole the Limelight from the Stargate Venture – IT Connection

Landmark Regulation Comes into Power – IT Connection

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

AI’s Multilingual Failure: NewsGuard Audit Finds Highest Failure Charges in Russian and Chinese language

AI’s Multilingual Failure: NewsGuard Audit Finds Highest Failure Charges in Russian and Chinese language

February 8, 2025
Backdoor in Chinese language-made healthcare monitoring system leaks affected person information

Backdoor in Chinese language-made healthcare monitoring system leaks affected person information

February 1, 2025

Categories

  • Artificial Intelligence
  • Computer Networking
  • Cyber Security
  • Data Analysis
  • Disaster Restoration
  • Machine Learning

CyberDefenseGo

Welcome to CyberDefenseGo. We are a passionate team of technology enthusiasts, cybersecurity experts, and AI innovators dedicated to delivering high-quality, insightful content that helps individuals and organizations stay ahead of the ever-evolving digital landscape.

Recent

Discord Invite Hyperlink Hijacking Delivers AsyncRAT and Skuld Stealer Concentrating on Crypto Wallets

Discord Invite Hyperlink Hijacking Delivers AsyncRAT and Skuld Stealer Concentrating on Crypto Wallets

June 14, 2025
How A lot Does Mould Elimination Value in 2025?

How A lot Does Mould Elimination Value in 2025?

June 14, 2025

Search

No Result
View All Result

© 2025 CyberDefenseGo - All Rights Reserved

No Result
View All Result
  • Home
  • Cyber Security
  • Artificial Intelligence
  • Machine Learning
  • Data Analysis
  • Computer Networking
  • Disaster Restoration

© 2025 CyberDefenseGo - All Rights Reserved

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In