Search code examples
pythonnlpgensimdoc2vec

Gensim Doc2vec model: how to compute similarity on a corpus obtained using a pre-trained doc2vec model?


I have a model based on doc2vec trained on multiple documents. I would like to use that model to infer the vectors of another document, which I want to use as the corpus for comparison. So, when I look for the most similar sentence to one I introduce, it uses this new document vectors instead of the trained corpus. Currently, I am using the infer_vector() to compute the vector for each one of the sentences of the new document, but I can't use the most_similar() function with the list of vectors I obtain, it has to be KeyedVectors.

I would like to know if there's any way that I can compute these vectors for the new document that will allow the use of the most_similar() function, or if I have to compute the similarity between each one of the sentences of the new document and the sentence I introduce individually (in this case, is there any implementation in Gensim that allows me to compute the cosine similarity between 2 vectors?).

I am new to Gensim and NLP, and I'm open to your suggestions.

I can not provide the complete code, since it is a project for the university, but here are the main parts in which I'm having problems.

After doing some pre-processing of the data, this is how I train my model:

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(train_data)]
assert gensim.models.doc2vec.FAST_VERSION > -1

cores = multiprocessing.cpu_count()

doc2vec_model = Doc2Vec(vector_size=200, window=5, workers=cores)
doc2vec_model.build_vocab(documents)
doc2vec_model.train(documents, total_examples=doc2vec_model.corpus_count, epochs=30)

I try to compute the vectors for the new document this way:

questions = [doc2vec_model.infer_vector(line) for line in lines_4]

And then I try to compute the similarity between the new document vectors and an input phrase:

text = str(input('Me: '))

tokens = text.split()

new_vector = doc2vec_model.infer_vector(tokens)

index = questions[i].most_similar([new_vector])

Solution

  • A dirty solution I used about a month ago in gensim==3.2.0 (the syntax might have changed).

    You can save your inferred vectors in KeyedVectors format.

    from gensim.models import KeyedVectors
    from gensim.models.doc2vec import Doc2Vec
    vectors = dict()
    # y_names = doc2vec_model.docvecs.doctags.keys()
    y_names = range(len(questions))
    
    for name in y_names:
        # vectors[name] = doc2vec_model.docvecs[name]
        vectors[str(name)] = questions[name]
    f = open("question_vectors.txt".format(filename), "w")
    f.write("")
    f.flush()
    f.close()
    f = open("question_vectors.txt".format(filename), "a")
    f.write("{} {}\n".format(len(questions), doc2vec_model.vector_size))
    for v in vectors:
        line = "{} {}\n".format(v, " ".join(questions[v].astype(str)))
        f.write(line)
    f.close()
    

    then you can load and use most_similar function

    keyed_model = KeyedVectors.load_word2vec_format("question_vectors.txt")
    keyed_model.most_similar(str(list(y_names)[0]))
    

    Another solution (esp. if the number of questions is not so high) would be just to convert questions to a np.array and get cosine distance), e.g.

    import numpy as np
    
    questions = np.array(questions)
    texts_norm = np.linalg.norm(questions, axis=1)[np.newaxis].T
    norm = texts_norm * texts_norm.T
    
    product = np.matmul(questions, questions.T)
    product = product.T / norm
    
    # Otherwise the item is the closest to itself
    for j in range(len(questions)):
        product[j, j] = 0
    
    # Gives the top 10 most similar items to the 0th question
    np.argpartition(product[0], 10)