model save gensim word-embedding doc2vec

Gensim Doc2vec trained, but not saved

While I trained d2v on a large text corpus I received these 3 files:

doc2vec.model.trainables.syn1neg.npy

doc2vec.model.vocabulary.cum_table.npy

doc2vec.model.wv.vectors.npy

Bun final model has not saved, because there was not enough free space available on the disk.

OSError: 5516903000 requested and 4427726816 written

Is there a way to resave my model using these files in a shorter time, than all training time?

Thank you in advance!

Solution

If you still have the model in RAM, in an environment (like a Jupyter notebook) where you can run new code, you could try to clear space (or attach a new volume) and then try a .save() again. That is, you don't need to re-train, just re-save what's already in RAM.

There's no routine for saving "just what isn't already saved". So even though the subfiles that did save could potentially be valuable if you were desperate to salvage anything from the 1st training run (perhaps via a process like in my Word2Vec answer here, though it's a bit more complicated with Doc2Vec), trying another save to the same place/volume would require getting those existing files out-of-the-way. (Maybe you could transfer them to remote storage in case they'll be needed, but delete them locally to free space?)

If you try to save to a filename that ends ".gz", gensim will try to save everything compressed, which might help a little. (Unfortunately, the main vector arrays don't compress very well, so this might not be enough savings alone.)

There's no easy way to slim an already-trained model in memory, without potentially destroying some of its capabilities. (There are hard ways, but only if you're sure you can discard things a full model could do... and it's not yet clear you're in that situation.)

The major contributors to model size are the number of unique-words, and the number of unique doc-tags.

Specifying a larger min_count before training will discard more low-frequency words – and very-low-frequency words often just hurt the model anyway, so this trimming often improves three things simultaneously: faster training, smaller model, and higher-quality-results on downstream tasks.

If you're using plain-int doc-tags, the model will require vector space for all doc-tag ints from 0 to your highest number. So even if you trained just 2 documents, if they had plain-int doc-tags of 999998 and 999999, it'd still need to allocate (and save) garbage vectors for 1 million tags, 0 to 999,999. So in some cases people's memory/disk usage is higher than expected because of that – and either using contiguous IDs starting from 0, or switching to string-based doc-tags, reduces size a lot. (But, again, this has to be chosen before training.)