Search code examples
word2vecgensimdoc2vec

gensim: 'Doc2Vec' object has no attribute 'intersect_word2vec_format' when I load the Google pre-trained word2vec model


I get this error when I load the google pre-trained word2vec to train doc2vec model with my own data. Here is part of my code:

model_dm=doc2vec.Doc2Vec(dm=1,dbow_words=1,vector_size=400,window=8,workers=4)
model_dm.build_vocab(document)
model_dm.intersect_word2vec_format('home/xxw/Downloads/GoogleNews-vectors-negative300.bin',binary=True)
model_dm.train(document)

But I got this error:

'Doc2Vec' object has no attribute 'intersect_word2vec_format'

Can you help me with the error? I get the google model from https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz, and my gensim is the latest version I think.


Solution

  • A recent refactor made Doc2Vec no longer share a superclass with this method. You might be able to call the method on your model_dm.wv object instead, but I'm not sure. Otherwise you could look at the source and mimic the code to achieve the same effect, if you really need that step.

    But note that Doc2Vec doesn't need word-vectors as input: it can learn everything it needs from your own training data. Whether word-vectors from elsewhere will help will depend on a lot of factors – and the larger your own data is, or the more unique, the less preloaded vectors from elsewhere are likely to help, or even have any residual effect when your own training is done.

    Other notes on your apparent setup:

    • dbow_words=1 will have no effect in dm=1 mode - that mode already inherently trains word-vectors. (It only has effect in dm=0 DBOW mode, where it adds extra interleaved word-training, if you need word-vectors. Often plain DBOW, without word-vector training, is a fast and effective option.)

    • Recent versions of gensim require more arguments to train, and note that typical published work with this algorithm use 10-20 (or sometimes more) passes over the data (as can be specified to train() via the epochs argument), rather than the default (in some versions of gensim) of 5.