Search code examples
pythonpython-3.xpython-itertoolsfuzzy-search

How can I match two table indexes using fuzzy string metrics, while preventing duplicate matches in python?


I'm trying to combine two tables together in python, but the row titles/keys are sometimes different, so I'm using jellyfish to compare the keys and generating a list of percentages for each possible match (seen in match_list below)

My issue is around finding the best combination of matches based on these percentages, without having any repeated keys.

For example, lets imagine table_1 and table_2 had the matches presented in match_list below. The maximum match for each row WITHOUT considering if the other rows have the same match can be found using the code at the bottom.

table_1 = [['t1_key_1', ['t1_value_1']],
           ['t1_key_2', ['t1_value_2']],
           ['t1_key_n', ['t1_value_n']]]


table_2 = [['t2_key_1', 't2_value_1'],
           ['t2_key_2', 't2_value_2'],
           ['t2_key_n', 't2_value_n']] 

match_list = [['t1_key_1',
                    [['t2_key_1', 0.9],
                     ['t2_key_2', 0.9],
                     ['t2_key_n', 0.6]]],
              ['t1_key_2',
                   [['t2_key_1', 0.9],
                    ['t2_key_2', 0.8],
                    ['t2_key_n', 0.2]]],
              ['t1_key_n',
                   [['t2_key_1', 0.7],
                    ['t2_key_2', 0.9],
                    ['t2_key_n', 0.8]]]
            ]


result = []
for row in match_list:
    row[1].sort(key=lambda x: x[1], reverse=True)
    result.append([row[0],row[1][0]])

print(result)

returns

[['t1_key_1', ['t2_key_1', 0.9]],
 ['t1_key_2', ['t2_key_1', 0.9]], 
 ['t1_key_n', ['t2_key_2', 0.9]]]

Note: that 't2_key_1' is repeated for both 't1_key_1' and 't1_key_2', this is the repetition issue

So my question is, how can I make it so the set of matches only uses unique values, while maximising the match percentage?

Desired output:

[['t1_key_1', ['t2_key_2', 0.9]],
 ['t1_key_2', ['t2_key_1', 0.9]], 
 ['t1_key_n', ['t2_key_n', 0.8]]]

I think it might be possible to calculate the total match value for each possible combination (where all t2_keys are unique), and then select the highest, but I couldn't work out how to make a list of all possible unique combinations.

Also, this is the first questions I've asked on here, so feel free to go hard with suggestions on ways to improve my question explanation etc - and apologies in advance for the terrible title, it was a struggle to word!


Solution

  • itertools.product can produce all the combinations.

    import itertools
    
    table_1 = [['t1_key_1', ['t1_value_1']],
               ['t1_key_2', ['t1_value_2']],
               ['t1_key_n', ['t1_value_n']]]
    
    
    table_2 = [['t2_key_1', 't2_value_1'],
               ['t2_key_2', 't2_value_2'],
               ['t2_key_n', 't2_value_n']] 
    
    match_list = [['t1_key_1',
                        [['t2_key_1', 0.9],
                         ['t2_key_2', 0.9],
                         ['t2_key_n', 0.6]]],
                  ['t1_key_2',
                       [['t2_key_1', 0.9],
                        ['t2_key_2', 0.8],
                        ['t2_key_n', 0.2]]],
                  ['t1_key_n',
                       [['t2_key_1', 0.7],
                        ['t2_key_2', 0.9],
                        ['t2_key_n', 0.8]]]
                ]
    
    maxscore = 0
    maxcombo = []
    for a,b,c in itertools.product( *(ml[1] for ml in match_list) ):
        if a[0] == b[0] or b[0] == c[0] or a[0] == c[0]:
            continue
        score = a[1]+b[1]+c[1]
        if score > maxscore:
            maxscore = score
            maxcombo = (a,b,c)
    print(maxscore)
    print([[k1[0],m1] for k1,m1 in zip(match_list, maxcombo)])
    

    Output:

    2.6
    [['t1_key_1', ['t2_key_2', 0.9]], ['t1_key_2', ['t2_key_1', 0.9]], ['t1_key_n', ['t2_key_n', 0.8]]]