Search code examples
nlpnltkgensimword2vecword-embedding

What is the difference between wmd (word mover distance) and wmd based similarity?


I am using WMD to calculate the similarity scale between sentences. For example:

distance = model.wmdistance(sentence_obama, sentence_president)

Reference: https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html

However, there is also WMD based similarity method (WmdSimilarity).

Reference: https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html

What is the difference between the two except the obvious that one is distance and another similarity?

Update: Both are exactly the same except with their different representation.

n_queries = len(query)
result = []
for qidx in range(n_queries):
    # Compute similarity for each query.
    qresult = [self.w2v_model.wmdistance(document, query[qidx]) for document in self.corpus]
    qresult = numpy.array(qresult)
    qresult = 1./(1.+qresult)  # Similarity is the negative of the distance.

    # Append single query result to list of all results.
    result.append(qresult)

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/similarities/docsim.py


Solution

  • I think with the 'update' you'e more-or-less answered your own question.

    That one is a distance, and the other a similarity, is the only difference between the two calculations. As the notebook you link notes in the relevant section:

    WMD is a measure of distance. The similarities in WmdSimilarity are simply the negative distance. Be careful not to confuse distances and similarities. Two similar documents will have a high similarity score and a small distance; two very different documents will have low similarity score, and a large distance.

    As the code you've excerpted shows, the similarity measure being used there is not exactly the 'negative' distance, but scaled so all similarity values are from 0.0 (exclusive) to 1.0 (inclusive). (That is, a zero distance becomes a 1.0 similarity, but ever-larger distances become ever-closer to 0.0.)