Search code examples
pythoncomparisonstring-comparisonsimilarityfuzzy-comparison

Similarity score of two lists with strings


I have a list of strings as a query and a few hundrends of other lists of strings. I want to compare the query with every other list and extract a similarity score between them.

Example:

query = ["football", "basketball", "martial arts", "baseball"]

list1 = ["apple", "football", "basketball court"]

list2 = ["ball"]

list3 = ["martial-arts", "baseball", "banana", "food", "doctor"]

What I am doing now and I am not satisfied with the results is an absolute compare of them.

score = 0
for i in query:
   if i in list1:
      score += 1

score_of_list1 = score*100//len(list1)

I found a library that may help me fuzzywuzzy, but I was thinking if you have any other way to suggest.


Solution

  • If you're looking for a way to find similarity between strings, this SO question suggests Levenshtein distance as a method of doing so.

    There is a solution ready, and it also exists in the Natural Language Tool Kit library.

    The naive integration would be (I use random merely to have a result. It doesn't make sense obviously):

    #!/usr/bin/env python
    query = ["football", "basketball", "martial arts", "baseball"]
    lists = [["apple", "football", "basketball court"], ["ball"], ["martial-arts", "baseball", "banana", "food", "doctor"]]
    from random import random
    
    def fake_levenshtein(word1, word2):
        return random()
    
    def avg_list(l):
            return reduce(lambda x, y: x + y, l) / len(l)
    
    for l in lists:
        score = []
        for w1 in l:
            for w2 in query:
                score.append(fake_levenshtein(w1, w2))
        print avg_list(score)
    

    Good luck.