I'm trying to train the Tokenizer with HuggingFace wiki_split datasets. According to the Tokenizers' documentation at GitHub, I can train the Tokenizer with the following codes:
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE())
# You can customize how pre-tokenization (e.g., splitting into words) is done:
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()
# Then training your tokenizer on a set of files just takes two lines of codes:
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)
# Once your tokenizer is trained, encode any text with just one line:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
However, the example is to load from three files: wiki.train.raw
, wiki.valid.raw
and wiki.test.raw
. In my case, I am loading from wiki_split
dataset. My code is as follow:
from tokenizers.trainers import BpeTrainer
def iterator_wiki(dataset):
for txt in dataset:
if type(txt) != float:
yield txt
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train_from_iterator(iterator_wiki(wiki_train), trainer=trainer)
The tokenizer.train_from_iterator()
only accepts 1 dataset split, how can I use the validation and test split here?
Use the iterator which iterates over all the 3 datasets one after the another. Reference
Also note that each element in the wiki_split
dataset is a dictionary. First element of train dataset is shown below:
{'complex_sentence': "'' New Day '' is a song by American hip hop recording artist 50 Cent , released on July 27 , 2012 , as an promotional single from his upcoming fifth studio album '' Street King Immortal '' ( 2013 ) .",
'simple_sentence_1': "'' New Day '' is a song by American hip hop recording artist 50 Cent . ",
'simple_sentence_2': " The song was released on July 27 , 2012 , as a single from his upcoming fifth studio album '' Street King Immortal '' ( 2013 ) ."}
Working Example
# Load the datasets
from datasets import load_dataset
train_dataset = load_dataset('wiki_split', split='train')
test_dataset = load_dataset('wiki_split', split='test')
val_dataset = load_dataset('wiki_split', split='validation')
# Iterator using the text form complex_sentence
def iterator_wiki(train_dataset, test_dataset, val_dataset):
for mydataset in [train_dataset, test_dataset, val_dataset]:
for i, data in enumerate(mydataset):
if isinstance(data.get("complex_sentence", None), str):
yield data["complex_sentence"]
from tokenizers.trainers import BpeTrainer
tokenizer = Tokenizer(BPE())
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train_from_iterator(iterator_wiki(
train_dataset, test_dataset, val_dataset), trainer=trainer)
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
Output:
['Hello', ',', 'y', "'", 'all', '!', 'How', 'are', 'you', '😁', '?']