Search code examples
gensimword2vecdoc2vec

Gensim Doc2Vec model returns different cosine similarity depending on the dataset


I trained two versions of doc2vec models with two datasets.

The first dataset was made with 2400 documents and the second one was made with 3000 documents including the documents which were used in the first dataset.

For an example,

dataset 1 = doc1, doc2, ... doc2400

dataset 2 = doc1, doc2, ... doc2400, doc2401, ... doc3000

I thought that both doc2vec models should return the same similarity score between doc1 and doc2, however, they returned different scores.

Does doc2vec model's result change upon the datasets even they include the same documents?


Solution

  • Yes, any addition to the training set will change the relative results.

    Further, as explained in the Gensim FAQ, even re-training with the exact same data will typically result in different end coordinates for each training doc, though each run should be about equivalently useful:

    https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q11-ive-trained-my-word2vec--doc2vec--etc-model-repeatedly-using-the-exact-same-text-corpus-but-the-vectors-are-different-each-time-is-there-a-bug-or-have-i-made-a-mistake-2vec-training-non-determinism

    What should remain roughly the same between runs is the neighborhoods around each document. That is, adding some extra training docs shouldn't change the general result that some candidate doc is "very close" or "closer than other docs" to some target doc - except to the extent that (1) the new docs might include some even-closer docs; and (2) a small amount of 'jitter' between runs, per the FAQ answer above.

    If in fact you see lots of change in the relative neighborhoods and top-N neighbors of a document, either in repeated runs or runs with small increments of extra data, there's possibly something else wrong in the training.

    In particular, 2400 docs is a pretty small dataset for Doc2Vec - smaller datasets might need smaller vector_size and/or more epochs and/or other tweaks to get more reliable results, and even then, might not show off the strengths of this algorithm on larger (tens-of-thousands to millions of docs) datasets.