Assessing doc2vec accuracy

I am trying to assess a doc2vec model based on the code from here. Basically, I want to know the percentual of inferred documents are found to be most similar to itself. This is my current code an:

    for doc_id, doc in enumerate(cur.execute('SELECT Text FROM Patents')):
        docs += 1
        doc = clean_text(doc)
        inferred_vector = model.infer_vector(doc)
        sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
        rank = [docid for docid, sim in sims].index(doc_id)
        ranks.append(rank) 

    counter = collections.Counter(ranks)
    accuracy = counter[0] / docs

This code works perfectly with smaller datasets. However, since I have a huge file with millions of documents, this code becomes too slow, it would take months to compute. I profiled my code and most of the time is consumed by the following line: sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)).

If I am not mistaken, this is having to measure each document to every other document. I think computation time might be massively reduced if I change this to topn=1 instead since the only thing I want to know is if the most similar document is itself or not. Doing this will basically take each doc (i.e., inferred_vector), measure its most similar document (i.e., topn=1), and then I just see if it is itself or not. How could I implement this? Any help or idea is welcome.

Solution

To have most_similar() return only the single most-similar document, that is as simple as specifying topn=1.

However, to know which one document of the millions is the most-similar to a single target vector, the similarities to all the candidates must be calculated & sorted. (If even one document was left out, it might have been the top-ranked one!)

Making sure absolutely no virtual-memory swapping is happening will help ensure that brute-force comparison happens as fast as possible, all in RAM – but with millions of docs, it will still be time-consuming.

What you're attempting is a fairly simple "self-check" as to whether training led to self-consistent model: whether the re-inference of a document creates a vector "very close to" the same doc-vector left over from bulk training. Failing that will indicate some big problems in doc-prep or training, but it's not a true measure of the model's "accuracy" for any real task, and the model's value is best evaluated against your intended use.

Also, because this "re-inference self-check" is just a crude sanity check, there's no real need to do it for every document. Picking a thousand (or ten thousand, or whatever) random documents will give you a representative idea of whether most of the re-inferred vectors have this quality, or not.

Similarly, you could simply check the similarity of the re-inferred vector against the single in-model vector for that same document-ID, and check whether they are "similar enough". (This will be much faster, but could also be done on just a random sample of docs.) There's no magic proper threshold for "similar enough"; you'd have to pick one that seems to match your other goals. For example, using scikit-learn's cosine_similarity() to compare the two vectors:

from sklearn.metrics.pairwise import cosine_similarity

# ...

    inferred_vector = model.infer_vector(doc_words)
    looked_up_vector = model.dv[doc_id]
    self_similarity = cosine_similarity([inferred_vector], [looked_up_vector])[0]
    # then check that value against some threshold

(You have to wrap the single vectors in lists as arguments to cosine_similarity(), then access the 0th element of the return value, because it is designed to usually work on larger lists of vectors.)

With this calculation, you wouldn't know if, for example, some of the other stored-doc-vectors are a little closer to your inferred target - but that may not be that important, anyway. The docs might be really similar! And while the original "closest to itself" self-check will fail miserably if there were major defects in training, even a well-trained model will likely have some cases where natural model jitter prevents a "closest to itself" for every document. (With more documents inside the same number of dimensions, or certain corpuses with lots of very-similar documents, this would become more common... but not be a concerning indicator of any model problems.)