Search code examples
javagensimword2vecdoc2vecdeeplearning4j

Importing a gensim doc2vec model in deeplearning4j


I have trained a doc2vec model with gensim and like to import it into Deeplearning4j in order to deploy that model.

For word2vec models, I know that this is possible by saving the model with

model.wv.save_word2vec_format("word2vec.bin", binary=True)

and importing if in Java with

Word2Vec w2vModel = WordVectorSerializer.readWord2VecModel("word2vec.bin");

Is there a similar way to import a doc2vec model?


Solution

  • The save_word2vec_format() method saves just the word-vectors, not the full model.

    If you were to use Gensim's .save() to save the full model, it'd use Python's native serialization - so any Java code to read it would have to understand that format before rearranging relevant properties into the DL4J objects.

    I don't see anything in the docs for DL4J's ParagraphVectors class docs suggesting it can read Gensim-formatted models, so I doubt there's any built-in support.

    It's theoretically possible that some Python code could be written to dump all the relevant subparts of the model in forms amenable to reading in Java, then patching into a Dl4J model, or for Java code to be written to understand the Python serialized objects – but that'd require some familiarity with both the Gensim & DL4J source code.

    (If the toJson() & fromJson() methods in DL4J work with full model representations – which isn't clear from the docs, and would be an extremely bloated format for the bulk of the model state – that'd likely make the model-translation a little easier, as it'd provide a straightforward template for what some new Python code would need to write-out.)