Search code examples
pythonpython-3.xpandasdataframefuzzywuzzy

Deduplication with FuzzyWuzzy creating new columns


I've been trying to find duplicates for all cells in a column called "text" that at least match 90% and only keep the first row in case of duplicates found (and remove the rest of the duplicate rows). This should then be displayed in a new csv file.

I have tried to do so with this MWE, however it seems to create 2 new columns called "Matches" and "Combined" that I don't need, as a new csv without the duplicates and with only the first occurence would be the eventual goal.

import pandas as pd
from dedupe_FuzzyWuzzy import deduplication

df = pd.read_csv('/path/input.csv')
# normal duplication drop
df = df.drop_duplicates(subset='text', keep='first')

# threshold drop
df_final = deduplication.deduplication(df, ['text'],threshold=90)

# send output to csv
df_final.to_csv('/path/deduplicated.csv',index=False)

Solution

  • This code, with a basic example, uses rapidfuzz to mark fuzzy-matched duplications in a text column of a pandas DataFrame. Note: higher threshold means more severe matching. The code goes through a List of text values from the column, checks for fuzzy-duplication and marks for deletion. A deletion list is then used as a mask to remove selected DataFrame rows.

    import pandas as pd
    import rapidfuzz
    txt = ['abc', 'abcdx', 'xyx', 'abcdef', 'xyxg','abcde', 'abxdx', 'xyxk', 'abcdex', '1234', 'abxdx2', '12345']
    vals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
    
    df = pd.DataFrame({'col1': txt, 'col2': vals})
    
    
    def mark_dupes(txt_col, threshold = 90):
    
        txt_list = txt_col.to_list()
        marked = [True]*len(txt_list)
        
        for i, txt in enumerate(txt_list):
            if not marked[i]:      #don't check duplicates of text rows marked for removal
                continue
            for j, others in enumerate(txt_list[i+1::]): 
                if marked[i+j+1]:     # only look through vals not already marked for removal
                    if rapidfuzz.fuzz.ratio(txt, others, score_cutoff = threshold):
                        marked[i+j+1] = False   # mark for removal
        return marked
    
    chk = mark_dupes(df['col1'], threshold = 80)
    
    dfx = df[chk]
    print(dfx)
    

    which prints

         col1  col2
    0     abc     1
    2     xyx     3
    3  abcdef     4
    6   abxdx     7
    9    1234    10