Search code examples
gensimdoc2vec

Load Doc2Vec without the docs vectors only for infer_vector


I have big gensim Doc2vec model, I only need to infer vectors while i am loading the training documents vectors from other source. Is it possible to load it as is without the big npy file

I did

Edit:

from gensim.models.doc2vec import Doc2Vec
model_path = r'C:\model/model'
model = Doc2Vec.load(model_path)
model.delete_temporary_training_data(keep_doctags_vectors=False, keep_inference=True)
model.save(model_path)

remove the files (model.trainables.syn1neg.npy,model.wv.vectors.npy) manually

model = Doc2Vec.load(model_path)

but it ask for

Traceback (most recent call last):

  File "<ipython-input-5-7f868a7dbe0c>", line 1, in <module>
    model = Doc2Vec.load(model_path)

  File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\models\doc2vec.py", line 1113, in load
    return super(Doc2Vec, cls).load(*args, **kwargs)

  File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\models\base_any2vec.py", line 1244, in load
    model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)

  File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\models\base_any2vec.py", line 603, in load
    return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)

  File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\utils.py", line 427, in load
    obj._load_specials(fname, mmap, compress, subname)

  File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\utils.py", line 458, in _load_specials
    getattr(self, attrib)._load_specials(cfname, mmap, compress, subname)

  File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\utils.py", line 469, in _load_specials
    val = np.load(subname(fname, attrib), mmap_mode=mmap)

  File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\numpy\lib\npyio.py", line 428, in load
    fid = open(os_fspath(file), "rb")

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\model/model.trainables.syn1neg.npy'

Note: Those files not exists in the directory, The model run on a server and download the model file from the storage My question is, Do the model must have those files for inference? I want to run it as low memory consumption as possible. Thanks.

Edit: Is the file model.trainables.syn1neg.npy is the model weights? Is the file model.wv.vectors.npy is necessary for running an inference?


Solution

  • I'm not a fan of the delete_temporary_training_data() method. It implies there's a clearer separation between training-state and that needed for later uses. (Inference is very similar to training, though it doesn't need the cached doc-vectors for training texts.)

    That said, if you've used that method, you shouldn't then be deleting any of the side-files that were still part of the save. If they were written by .save(), they'll be expected, by name, by the .load(). They must be kept with the main model file. (There might be fewer such files, or smaller such files, after the delete_temporary_training_data() call - but any written must be kept for reading.)

    The syn1neg file is absolutely required for inference: it's the model's hidden-to-output weights, needed to perform new forward-predictions (and thus also backpropagated inference-adjustments). The wv.vectors file is definitely needed in default dm=1 mode, where word-vectors are part of the doc-vector calculation. (It might be optional in dm=0 mode, but I'm not sure the code is armored against them being absent - not via in-memory trimming, and definitely not against the expected file being deleted out-of-band.)