Search code examples
pythongroup-bylevenshtein-distancefuzzy-searchfuzzy-logic

How to group words whose Levenshtein distance is more than 80 percent in Python


Suppose I have a list:-

person_name = ['zakesh', 'oldman LLC', 'bikash', 'goldman LLC', 'zikash','rakesh']

I am trying to group the list in such a way so the Levenshtein distance between two strings is maximum. For finding out the ratio between two words, I am using a python package fuzzywuzzy.

Examples :-

>>> from fuzzywuzzy import fuzz
>>> combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
>>> fuzz.ratio('goldman LLC', 'oldman LLC')
95
>>> fuzz.ratio('rakesh', 'zakesh')
83
>>> fuzz.ratio('bikash', 'zikash')
83
>>> 

My end goal:

My end goal is to group the words such that Levenshtein distance between them is more than 80 percent?

My list should look something like this :-

person_name = ['bikash', 'zikash', 'rakesh', 'zakesh', 'goldman LLC', 'oldman LLC'] because the distance between `bikash` and `zikash` is very high so they should be together.

Code:

I am trying to achieve this by sorting but key function should be fuzz.ratio. Well below code is not working, but I am approaching the problem in this angle.

from fuzzywuzzy import fuzz
combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
combined_list.sort(key=lambda x, y: fuzz.ratio(x, y))
print combined_list

Could anyone help me to combine the words so that Levenshtein distance between them is more than 80 percent?


Solution

  • This groups the names

    from fuzzywuzzy import fuzz
    
    combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
    combined_list.append('bakesh')
    print('input names:', combined_list)
    
    grs = list() # groups of names with distance > 80
    for name in combined_list:
        for g in grs:
            if all(fuzz.ratio(name, w) > 80 for w in g):
                g.append(name)
                break
        else:
            grs.append([name, ])
    
    print('output groups:', grs)
    outlist = [el for g in grs for el in g]
    print('output list:', outlist)
    

    producing

    input names: ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC', 'bakesh']
    output groups: [['rakesh', 'zakesh', 'bakesh'], ['bikash', 'zikash'], ['goldman LLC', 'oldman LLC']]
    output list: ['rakesh', 'zakesh', 'bakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
    

    As you can see, the names are grouped correctly, but the order may not be the one you desire.