python python-3.x nlp huggingface-transformers huggingface-tokenizers

Tokenizers change vocabulary entry

I have some text which I want to perform NLP on. To do so, I download a pre-trained tokenizer like so:

import transformers as ts

pr_tokenizer = ts.AutoTokenizer.from_pretrained('distilbert-base-uncased', cache_dir='tmp')

Then I create my own tokenizer with my data like this:

from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

tokenizer.train(['transcripts.raw'], trainer)

Now comes the part where I get confused... I need to update the entries in the pretraned tokenizer (pr_tokenizer) where they are the keys are the same as in my tokenizer (tokenizer). I have tried several methods, so here is one of them:

new_vocab = pr_tokenizer.vocab
v = tokenizer.get_vocab()

for i in v:
    if i in new_vocab:
        new_vocab[i] = v[i]

So what do I do now? I was thinking something like:

pr_tokenizer.vocab.update(new_vocab)

pr_tokenizer.vocab = new_vocab

Neither work. Does anyone know a good way of doing this?

Solution

To do that, you can just download the tokenizer source from GitHub or the HuggingFace website into the same folder as your code, and then edit the vocabulary before the tokenizer is loaded:

new_vocab = {}

# Getting the vocabulary entries
for i, row in enumerate(open('./distilbert-base-uncased/vocab.txt', 'r')): 
    new_vocab[row[:-1]] = i

# your vocabulary entries
v = tokenizer.get_vocab()

# replace common (your code)
for i in v:
    if i in new_vocab:
        new_vocab[i] = v[i]

with open('./distilbert-base-uncased/vocabb.txt', 'w') as f:
    # reversed vocabulary
    rev_vocab = {j:i for i,j in zip(new_vocab.keys(), new_vocab.values())}
    # adding vocabulary entries to file
    for i in range(len(rev_vocab)):
        if i not in rev_vocab: continue
        f.write(rev_vocab[i] + '\n')

# loading the new tokenizer
pr_tokenizer = ts.AutoTokenizer.from_pretrained('./distilbert-base-uncased')