Search code examples
pythonnlpgensimword2vec

Is the Gensim word2vec model same as the standard model by Mikolov?


I am implementing a paper to compare our performance. In the paper, the uathor says

300-dimensional pre-trained word2vec vectors (Mikolov et al., 2013)

I am wondering whether the pretrained word2vec Gensim model here is same as the pretrained embeddings on the official Google site (the GoogleNews-vectors-negative300.bin.gz file)


My source of doubt arises from this line in Gensim documentation (in Word2Vec Demo section)

We will fetch the Word2Vec model trained on part of the Google News dataset, covering approximately 3 million words and phrases

Does this mean the model on gensim is not fully trained? Is it different from the official embeddings by Mikolov?


Solution

  • That demo code for reading word-vectors is downloading the exact same Google-trained GoogleNews-vectors-negative300 set of vectors. (No one else can try re-training that dataset, because the original corpus of news articles user, over 100B words of training data from around 2013 if I recall correctly, is internal to Google.)

    Algorithmically, the gensim Word2Vec implementation was closely modeled after the word2vec.c code released by Google/Mikolov, so should match its results in measurable respects with regard to any newly-trained vectors. (Slight differences in threading approaches may have a slight difference.)