Search code examples
nlpembedding

What does tokens and vocab mean in glove embeddings?


I am using glove embeddings and I am quite confused about tokens and vocab in the embeddings. Like this one:

Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

what does tokens and vocab mean, respectively? What is the difference?


Solution

  • In NLP tokens refers to the total number of "words" in your corpus. I put words in quotes because the definition varies by task. The vocab is the number of unique "words".

    It should be the case that vocab <= tokens.