Search code examples
pythonmachine-learningnlpgensimdoc2vec

Measure similarity between two documents using Doc2Vec


I have already trained gensim doc2Vec model, which is finding most similar documents to an unknown one.

Now I need to find the similarity value between two unknown documents (which were not in the training data, so they can not be referenced by doc id)

d2v_model = doc2vec.Doc2Vec.load(model_file)

string1 = 'this is some random paragraph'
string2 = 'this is another random paragraph'

vec1 = d2v_model.infer_vector(string1.split())
vec2 = d2v_model.infer_vector(string2.split())

in the code above vec1 and vec2 are successfully initialized to some values and of size - 'vector_size'

now looking through the gensim api and examples I could not find method that works for me, all of them are expecting TaggedDocument

Can I compare the feature vectors value by value and if they are closer => the texts are more similar?


Solution

  • Hello just In case someone is interested, to do this you just need the cosine distance between the two vectors.

    I found that most people are using 'spatial' for this pourpose

    Here is a small code sniped that should work pretty well if you already have trained doc2vec

    from gensim.models import doc2vec
    from scipy import spatial
    
    d2v_model = doc2vec.Doc2Vec.load(model_file)
    
    fisrt_text = '..'
    second_text = '..'
    
    vec1 = d2v_model.infer_vector(fisrt_text.split())
    vec2 = d2v_model.infer_vector(second_text.split())
    
    cos_distance = spatial.distance.cosine(vec1, vec2)
    # cos_distance indicates how much the two texts differ from each other:
    # higher values mean more distant (i.e. different) texts