I want to be sure I understand correctly:
Using the length of embedding model means number of different tokens it contains?
i.e:
from gensim import downloader
embedding_model = downloader.load('glove-wiki-gigaword-50')
print(len(embedding_model))
output:
400000
means: glove-wiki-gigaword-50
has 400000 different tokens (words) and each token (word) has the size of 50 bytes ?
Yes, len(model)
in this case gives you the count of words inside it.
model.vector_size
will give you the number of dimensions (not bytes) per vector. (The actual size of the vector in bytes will be 4 times the count of dimensions, as each float32
-sized value takes 4 bytes.)
I generally recommend against ever using the Gensim api.downloader
functionality: if you instead find & manually download from the original source of the files, you'll better understand their contents, formats, & limitations – and where the file has landed in your local filesystem. And, by then using a specific class/method to load the file, you'll better understand what kinds of classes/objects you're using, rather than whatever mystery-object downloader.load()
might have given you.