Search code examples
tokenizebert-language-model

Why was BERT's default vocabulary size set to 30522?


I have been trying to build a BERT model for a specific domain. However, my model is trained on non-English text, so I'm worried that the default token size, 30522, won't fit my model.

Does anyone know where the number 30522 came from?

I expect that researchers were fine-tuning their model by focusing on training time and vocabulary coverage, but a more clear explanation will be appreciated.


Solution

  • The number of 30522 is not "token size." It's the size of WordPiece vocabulary BERT was trained on. See this link for an explanation of WordPiece. The number 30522 likely means the base character set was 522 characters in size and the WordPiece algorithm was trained on 30,000 iterations.