nlp tokenize huggingface-transformers huggingface-tokenizers

what is the difference between len(tokenizer) and tokenizer.vocab_size

I'm trying to add a few new words to the vocabulary of a pretrained HuggingFace Transformers model. I did the following to change the vocabulary of the tokenizer and also increase the embedding size of the model:

tokenizer.add_tokens(['word1', 'word2', 'word3', 'word4'])
model.resize_token_embeddings(len(tokenizer))
print(len(tokenizer)) # outputs len_vocabulary + 4

But after training the model on my corpus and saving it, I found out that the saved tokenizer vocabulary size hasn't changed. After checking again I found out that the abovementioned code does not change the vocabulary size (tokenizer.vocab_size is still the same) and only the len(tokenizer) has changed.

So now my question is; what is the difference between tokenizer.vocab_size and len(tokenizer)?

Solution

From the HuggingFace docs, if you search for the method vocab_size you can see in the docstring that it returns the size excluding the added tokens:

Size of the base vocabulary (without the added tokens).

And then by also calling the len() method on the tokenizer object, which itself calls the __len__ method:

def __len__(self):
    """
    Size of the full vocabulary with the added tokens.
    """
    return self.vocab_size + len(self.added_tokens_encoder)

So you can clearly see that the former returns the size excluding the added tokens, and the later includes the added tokens as it is essentially the former (vocab_size) plus the len(added_tokens_encoder).