Search code examples
pythonpython-3.xnlphuggingface-transformershuggingface-tokenizers

Tokenizers change vocabulary entry


I have some text which I want to perform NLP on. To do so, I download a pre-trained tokenizer like so:

import transformers as ts

pr_tokenizer = ts.AutoTokenizer.from_pretrained('distilbert-base-uncased', cache_dir='tmp')

Then I create my own tokenizer with my data like this:

from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

tokenizer.train(['transcripts.raw'], trainer)

Now comes the part where I get confused... I need to update the entries in the pretraned tokenizer (pr_tokenizer) where they are the keys are the same as in my tokenizer (tokenizer). I have tried several methods, so here is one of them:

new_vocab = pr_tokenizer.vocab
v = tokenizer.get_vocab()

for i in v:
    if i in new_vocab:
        new_vocab[i] = v[i]

So what do I do now? I was thinking something like:

pr_tokenizer.vocab.update(new_vocab)

or

pr_tokenizer.vocab = new_vocab

Neither work. Does anyone know a good way of doing this?


Solution

  • To do that, you can just download the tokenizer source from GitHub or the HuggingFace website into the same folder as your code, and then edit the vocabulary before the tokenizer is loaded:

    new_vocab = {}
    
    # Getting the vocabulary entries
    for i, row in enumerate(open('./distilbert-base-uncased/vocab.txt', 'r')): 
        new_vocab[row[:-1]] = i
    
    # your vocabulary entries
    v = tokenizer.get_vocab()
    
    # replace common (your code)
    for i in v:
        if i in new_vocab:
            new_vocab[i] = v[i]
    
    with open('./distilbert-base-uncased/vocabb.txt', 'w') as f:
        # reversed vocabulary
        rev_vocab = {j:i for i,j in zip(new_vocab.keys(), new_vocab.values())}
        # adding vocabulary entries to file
        for i in range(len(rev_vocab)):
            if i not in rev_vocab: continue
            f.write(rev_vocab[i] + '\n')
    
    # loading the new tokenizer
    pr_tokenizer = ts.AutoTokenizer.from_pretrained('./distilbert-base-uncased')