Search code examples
pythonsimilarity

compare similarity between sets in python


I have two sentences in python, that are represents sets of words the user gives in input as query for an image retrieval software:

sentence1 = "dog is the"
sentence2 = "the dog is a very nice animal"

I have a set of images that have a description, so for example:

sentence3 = "the dog is running in your garden"

I want to recover all the images that have a description "very close" to the query inserted by user, but this part related to description should be normalized between 0 and 1 since it is just a part of a more complex research which considers also geotagging and low level features of images.

Given that I create three sets using:

set_sentence1 = set(sentence1.split())
set_sentence2 = set(sentence2.split())
set_sentence3 = set(sentence3.split())

And compute the intersection between sets as:

intersection1 = set_sentence1.intersection(set_sentence3)
intersection2 = set_sentence2.intersection(set_sentence3)

How can i normalize efficiently the comparison?

I don't want to use levensthein distance, since I'm not interested in string similarity, but in set similarity.


Solution

  • maybe a metric like:

    Similarity1 = (1.0 + len(intersection1))/(1.0 + max(len(set_sentence1), len(set_sentence3)))
    Similarity2 = (1.0 + len(intersection2))/(1.0 + max(len(set_sentence2), len(set_sentence3)))