python string grouping similarity levenshtein-distance

Grouping similar strings together from a list

I've been struggling this morning with a problem at the office.

I need to find a way to group strings together from a list. It's hard to explain so here's an example:

Let's say I have a list as follow:

['MONTREAL EDUCATION BOARD', 'Île de Montréal', 'Montréal',
       'Ville de Montréal', 'MONTREAL CITY', 'Monrtéal', 'Mont-réal',
       'Toronto', 'Toronto city', 'Tornoto', 'What is this', 'Bananasplit',
       'Banana', 'StLouis', 'St-Louis', 'Saint Louis']

I need to find a way to group these values together depending on their similarity:

[['MONTREAL EDUCATION BOARD'],
 ['Île de Montréal', 'Montréal','Ville de Montréal', 'MONTREAL CITY', 'Monrtéal', 'Mont-réal'],
 ['Toronto', 'Toronto city', 'Tornoto'],
 ['anything'],
 ['Bananasplit', 'Banana'],
 ['StLouis', 'St-Louis', 'Saint Louis']
]

That would be the perfect case. Obviously it can have (and will) errors. I need to do this with around 10 000 lists which contains from 5 to 15 000 strings each. I need to minimize the error and get the best groups I can.

I am using a slightly modified version of fuzzywuzzy. I first take off the accents and capitalize all letters for a more accurate levenshtein distance.

What I tried is setting a threshold (let's say 80), iterating through the list, making a group out of every string and taking out duplicates elements. Obviously this isn't the result I need because I need every element to appear in only one list (and it isn't the case because A can be linked to B, B to C but not A to C).

    groups = []
    for curr in lst:
        curr_grp = []
        for item in lst:
            ratio = normalized.partial_ratio(curr, item)
            if ratio > SET_THRESHOLD:
                curr_grp.append((item, ratio))

        groups.append(curr_grp)

I think that there could be a way to find the most optimal configuration from my output:

[[('MONTREAL EDUCATION BOARD', 100),
  ('Montréal', 100), # Will probably have to use ratio() and not partial_ratio() because
  ('Monrtéal', 88),  # this can't happen, EDUCATION BOARD is NOT Montreal
  ('Mont-réal', 89)],
 [('Île de Montréal', 100),
  ('Montréal', 100),
  ('Ville de Montréal', 93),
  ('Monrtéal', 88),
  ('Mont-réal', 94)],
 [('MONTREAL EDUCATION BOARD', 100),
  ('Île de Montréal', 100),
  ('Montréal', 100),
  ('Ville de Montréal', 100),
  ('MONTREAL CITY', 100),
  ('Monrtéal', 88),
  ('Mont-réal', 88)],
 [('Île de Montréal', 93),
  ('Montréal', 100),
  ('Ville de Montréal', 100),
  ('Monrtéal', 88),
  ('Mont-réal', 94)],
 [('Montréal', 100),
  ('MONTREAL CITY', 100),
  ('Monrtéal', 88),
  ('Mont-réal', 89)],
 [('MONTREAL EDUCATION BOARD', 88),
  ('Île de Montréal', 88),
  ('Montréal', 88),
  ('Ville de Montréal', 88),
  ('MONTREAL CITY', 88),
  ('Monrtéal', 100)],
 [('MONTREAL EDUCATION BOARD', 89),
  ('Île de Montréal', 94),
  ('Montréal', 88),
  ('Ville de Montréal', 94),
  ('MONTREAL CITY', 89),
  ('Mont-réal', 100)],
 [('Toronto', 100), ('Toronto city', 100), ('Tornoto', 86)],
 [('Toronto', 100), ('Toronto city', 100), ('Tornoto', 86)],
 [('Toronto', 86), ('Toronto city', 86), ('Tornoto', 100)],
 [('What is this', 100)],
 [('Bananasplit', 100), ('Banana', 100)],
 [('Bananasplit', 100), ('Banana', 100)],
 [('StLouis', 100), ('St-Louis', 86), ('Saint Louis', 86)],
 [('StLouis', 86), ('St-Louis', 100)],
 [('StLouis', 86), ('Saint Louis', 100)]]

Could it be possible to find the most optimal subset of this list where each element appears only in one group? (so with the highest score?) Take in consideration that my lists will be WAY bigger, so I can't test every configurations because it would take years.

Else, is there another more efficient way to do what I'm trying to do?

Thank you!

Solution

You can use a dictionary to from groups progressively with only the cities that have not yet been grouped.

Note that I don't have fussywuzzy so I created a ghetto ratio calculator to test the solution. I also removed the accent characters to make this easier (my objective was not to create a good string comparison function)

from collections import Counter
stripJunk = str.maketrans("","","- ")
def getRatio(a,b):
    a = a.lower().translate(stripJunk)
    b = b.lower().translate(stripJunk)
    total  = len(a)+len(b)
    counts = (Counter(a)-Counter(b))+(Counter(b)-Counter(a))
    return 100 - 100 * sum(counts.values()) / total

Here is the grouping logic (you can replace my custom getRatio() function with the one from fuzzywuzzy):

data = ['MONTREAL EDUCATION BOARD', 'Ile de Montreal', 'Montreal',
       'Ville de Montreal', 'MONTREAL CITY', 'Monrteal', 'Mont-real',
       'Toronto', 'Toronto city', 'Tornoto', 'What is this', 'Bananasplit',
       'Banana', 'StLouis', 'St Louis', 'Saint Louis']

treshold     = 75
minGroupSize = 1

from itertools import combinations

paired = { c:{c} for c in data }
for a,b in combinations(data,2):
    if getRatio(a,b) < treshold: continue
    paired[a].add(b)
    paired[b].add(a)

groups    = list()
ungrouped = set(data)
while ungrouped:
    bestGroup = {}
    for city in ungrouped:
        g = paired[city] & ungrouped
        for c in g.copy():
            g &= paired[c] 
        if len(g) > len(bestGroup):
            bestGroup = g
    if len(bestGroup) < minGroupSize : break  # to terminate grouping early change minGroupSize to 3
    ungrouped -= bestGroup
    groups.append(bestGroup)

The groups variable is a list that will contain sets of city names (the groups). Cities will only appear in one group.

# With a treshold of 75%:
{'MONTREAL CITY', 'Montreal', 'Monrteal', 'Mont-real'}
{'St Louis', 'StLouis', 'Saint Louis'}
{'Toronto', 'Toronto city', 'Tornoto'}
{'Ville de Montreal', 'Ile de Montreal'}
{'MONTREAL EDUCATION BOARD'}
{'Bananasplit'}
{'Banana'}
{'What is this'}

With a lower treshold (or a better comparison function), you will get fewer groups:

# With a treshold of 65%:
{'Monrteal', 'Montreal', 'Ville de Montreal', 'MONTREAL CITY', 'Mont-real', 'Ile de Montreal'}
{'Toronto', 'Toronto city', 'Tornoto'}
{'Saint Louis', 'StLouis', 'St Louis'}
{'Banana', 'Bananasplit'}
{'What is this'}
{'MONTREAL EDUCATION BOARD'}

From a performance perspective this will produce results in reasonable time for relatively small data sets. It took 83 seconds to group 1600 cities. Because of the O(N^2) nature of the combinations() loop, this will probably become impractical when you get to 15,000 items in the list.

The grouping loop starts with the larger groups. It takes up about half of the processing time. You could probably save a some of that time by stopping it once you reach a small enough group. That is if you don't need a bazillion 1-2 city groups. I tried stopping the grouping loop when the group size becomes less than 3 and the 1600 cities were processed in 48 seconds (so a significant save for my simulated data). You may not get this much of a performance boost with actual data though.