Search code examples
gensimdoc2vec

Understanding Gensim Doc2vec ranking


I use gensim 4.0.1 and follow tutorial 1 and 2:

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

texts = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

texts = [t.lower().split() for t in texts]

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(texts)]
model = Doc2Vec(documents, epochs=50, vector_size=5, window=2, min_count=2, workers=4)

new_vector = model.infer_vector("human machine interface".split())


for rank,(doc_id,score) in enumerate(model.dv.most_similar_cosmul(positive=[new_vector])):
        print('{}. {:.5f} [{}] {}'.format(rank, score, doc_id, ' '.join(documents[doc_id].words)))


1. 0.56613 [7] graph minors iv widths of trees and well quasi ordering
2. 0.55941 [6] the intersection graph of paths in trees
3. 0.55061 [2] the eps user interface management system
4. 0.54981 [1] a survey of user opinion of computer system response time
5. 0.52249 [4] relation of user perceived response time to error measurement
6. 0.52240 [8] graph minors a survey
7. 0.49214 [0] human machine interface for lab abc computer applications
8. 0.49016 [3] system and human system engineering testing of eps
9. 0.47899 [5] the generation of random binary unordered trees
​

Why the document[0] containing "human machine interface" has such a low (position 7) ranking? Is it a result of semantic generalization or the model needs to be tuned? Is larger corpus tutorial available to get repeatable results?


Solution

  • The problem is the same as in my prior anwer to a similar question:

    https://stackoverflow.com/a/66976706/130288

    Doc2Vec needs far more data to start working. 9 texts, with maybe 55 total words and perhaps around half that unique words is far too small to show any interesting results with this algorithm.

    A few of Gensim's Doc2Vec-specific test cases & tutorials manage to squeeze some vaguely understandable similarities out of a test dataset (from a file lee_background.cor) that has 300 documents, each of a few hundred words - so tens of thousands of words, several thousand of which are unique. But it still needs to reduce the dimensionality & up the epochs, and the results are still very weak.

    If you want to see meaningful results from Doc2Vec, you should be aiming for tens-of-thousands of documents, ideally with each document having dozens or hundreds or words.

    Everything short of that is going to be disappointing and not-representative of what sort of tasks the algorithm was designed to work with.

    There's a tutorial using a larger movie-review dataset (100K documents) that was also used in the original 'Paragraph Vector' paper at:

    https://radimrehurek.com/gensim/auto_examples/howtos/run_doc2vec_imdb.html#sphx-glr-auto-examples-howtos-run-doc2vec-imdb-py

    There's a tutorial based on Wikipedia (millions of documents) that might need some fixup to work nowadays at:

    https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb