Search code examples
gensimword2vecword-embedding

What is the meaning of size(embedding_model)?


I want to be sure I understand correctly:

Using the length of embedding model means number of different tokens it contains?

i.e:

from gensim import downloader
embedding_model = downloader.load('glove-wiki-gigaword-50')
print(len(embedding_model))

output:

400000 

means: glove-wiki-gigaword-50 has 400000 different tokens (words) and each token (word) has the size of 50 bytes ?


Solution

  • Yes, len(model) in this case gives you the count of words inside it.

    model.vector_size will give you the number of dimensions (not bytes) per vector. (The actual size of the vector in bytes will be 4 times the count of dimensions, as each float32-sized value takes 4 bytes.)

    I generally recommend against ever using the Gensim api.downloader functionality: if you instead find & manually download from the original source of the files, you'll better understand their contents, formats, & limitations – and where the file has landed in your local filesystem. And, by then using a specific class/method to load the file, you'll better understand what kinds of classes/objects you're using, rather than whatever mystery-object downloader.load() might have given you.