bert-language-model huggingface-transformers transformer-model

Must the vocab size must math the vocab_size in bert_config.json exactly?

I am seeing someone other's BERT model, in which the vocab.txt's size is 22110, but the vocab_size parameter's value is 21128 in bert_config.json.

I understand that these two numbers must be exactly the same. Is that right?

Solution

If it is really BERT that uses WordPiece tokenizer, then yes. Different lengths of vocabulary and vocab_size in the config would mean that there are either embeddings that can never be used or that there are vocabulary items without any embeddings.

In this case, you will see no error message because the model and the tokenizer are loaded separately. The embedding table of BERT has 8 embeddings that are no "reachable".

Note, however, that the model may use some very non-standard tokenizer that saves the vocabulary in such a way, it is 8 items shorter (although it is quite unlikely).