Search code examples
pythonpandasfuzzywuzzy

Create a Fuzzy Duplicate Key to Sum Rows with Fuzzy Matches (Pandas)


So I have a table where I have identified fuzzy matches and an amount. I want to be able to summarize the amount by this common key.

My Data looks like this:

Name Match1 Match2 Amount
Jame James Jim 1
Jame James Jim 2
Mike Mikes Miike 3
James Jame Jim 4
Jim Jame James 5
Mikes Mike Miike 6
Miike Mike Mikes 7
data = {'Name': {0: 'Jame',
  1: 'Jame',
  2: 'Mike',
  3: 'James',
  4: 'Jim',
  5: 'Mikes',
  6: 'Miike'},
 'Match1': {0: 'James',
  1: 'James',
  2: 'Mikes',
  3: 'Jame',
  4: 'Jame',
  5: 'Mike',
  6: 'Mike'},
 'Match2': {0: 'Jim',
  1: 'Jim',
  2: 'Miike',
  3: 'Jim',
  4: 'James',
  5: 'Miike',
  6: 'Mikes'},
 'Amount': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}}

df = pd.DataFrame.from_dict(data)

What I'd like my resulting table to look like:

Key Name Match1 Match2 Amount
Jame Jame James Jim 1
Jame Jame James Jim 2
Mike Mike Mikes Miike 3
Jame James Jame Jim 4
Jame Jim Jame James 5
Mike Mikes Mike Miike 6
Mike Miike Mike Mikes 7

So I can summarize my data like this:

Name Amount
Jame 12
Mike 16

Solution

  • If you initially have the list of keys for your dataframe, you can do this.

    mylist = ['Jame','Mike']
    
    df['Key'] = [j for i in df.to_numpy() for j in mylist if j in i]
    
    df.set_index(df.columns[-1]).reset_index(inplace=True)
    
        Key   Name Match1 Match2  Amount
    0  Jame   Jame  James    Jim       1
    1  Jame   Jame  James    Jim       2
    2  Mike   Mike  Mikes  Miike       3
    3  Jame  James   Jame    Jim       4
    4  Jame    Jim   Jame  James       5
    5  Mike  Mikes   Mike  Miike       6
    6  Mike  Miike   Mike  Mikes       7
    

    And then, you can do groupby.sum to get your desired output.

    df.groupby('Key')['Amount'].sum().reset_index()
        Key  Amount
    0  Jame      12
    1  Mike      16
    

    Update

    To obtain the key through the dataframe, you can use the process.extractOne in fuzzywuzzy to compare the matching percentage of every element in a row. If it's less than the threshold, we will update the key.

    from fuzzywuzzy import process
    
    key = [df.Name[0]]
    
    for row in df.to_numpy():
        for i in row:
            if 50 < process.extractOne(i, key)[1]:
                break
            else:
                key.append(row[0])
                break
                
    
    key
    Out[65]: ['Jame', 'Mike']