Search code examples
pythonpytorchnlphuggingface-transformers

What does the vocabulary of a pre-trained / fine-tuned T5 model look like?


My question is regarding the pre-trained T5 models found on Huggingface. In either case of taking the fully-trained model, or after fine-tuning it, is there an API function for directly downloading the vocabulary?

More specifically, the default vocab_size for T5 is 32128 (from the documentation). Does that mean that after the model is trained, its decoder can generate up to 32128 unique words?

As an aside, I have noticed that capitalization does sometimes appear in my fine-tuned T5, does that mean the 32128 vocabulary could also be comprised of capitalized variants of words, e.g., is there one vocab index for "hello" and another index for "Hello"?


Solution

    • The T5 default vocabulary consists of 32,128 subword tokens (utilizing the SentencePiece tokenizer), not word tokens. Thus, it can generate a larger vocabulary than the specified 32,128.

    • "hello" and "Hello" are treated as different tokens because T5's tokenizer is case-sensitive.