python semantics gensim word2vec doc2vec

Gensim Doc2Vec generating huge file for model

I am trying to run doc2vec library from gensim package. My problem is that when I am training and saving the model the model file is rather large(2.5 GB) I tried using this line :

model.estimate_memory()

But it didn't change anything. I also have tried to change max_vocab_size to decrease the space. But there was not luck. Can somebody help me with this matter?

Solution

Doc2Vec models can be large. In particular, any word-vectors in use will use 4 bytes per dimension, times two layers of the model. So a 300-dimension model with a 200,000 word vocabulary will use just for the vectors array itself:

200,000 vectors * 300 dimensions * 4 bytes/float * 2 layers = 480MB

(There will be additional overhead for the dictionary storing vocabulary information.)

Any doc-vectors will also use 4 bytes per dimnsion. So if you train a vectors for a million doc-tags, the model will use just for the doc-vectors array:

1,000,000 vectors * 300 dimensions * 4 bytes/float = 2.4GB

(If you're using arbitrary string tags to name the doc-vectors, there'll be additional overhead for that.)

To use less memory when loaded (which will also result in a smaller store file), you can use a smaller vocabulary, train fewer doc-vecs, or use a smaller vector size.

If you'll only need the model for certain narrow purposes, there may be other parts you can throw out after training – but that requires knowledge of the model internals/source-code, and your specific needs, and will result in a model that's broken (and likely to throw errors) for many other usual operations.