python python-3.x pandas dataframe fuzzywuzzy

Deduplication with FuzzyWuzzy creating new columns

I've been trying to find duplicates for all cells in a column called "text" that at least match 90% and only keep the first row in case of duplicates found (and remove the rest of the duplicate rows). This should then be displayed in a new csv file.

I have tried to do so with this MWE, however it seems to create 2 new columns called "Matches" and "Combined" that I don't need, as a new csv without the duplicates and with only the first occurence would be the eventual goal.

import pandas as pd
from dedupe_FuzzyWuzzy import deduplication

df = pd.read_csv('/path/input.csv')
# normal duplication drop
df = df.drop_duplicates(subset='text', keep='first')

# threshold drop
df_final = deduplication.deduplication(df, ['text'],threshold=90)

# send output to csv
df_final.to_csv('/path/deduplicated.csv',index=False)

Solution

This code, with a basic example, uses rapidfuzz to mark fuzzy-matched duplications in a text column of a pandas DataFrame. Note: higher threshold means more severe matching. The code goes through a List of text values from the column, checks for fuzzy-duplication and marks for deletion. A deletion list is then used as a mask to remove selected DataFrame rows.

import pandas as pd
import rapidfuzz
txt = ['abc', 'abcdx', 'xyx', 'abcdef', 'xyxg','abcde', 'abxdx', 'xyxk', 'abcdex', '1234', 'abxdx2', '12345']
vals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

df = pd.DataFrame({'col1': txt, 'col2': vals})


def mark_dupes(txt_col, threshold = 90):

    txt_list = txt_col.to_list()
    marked = [True]*len(txt_list)
    
    for i, txt in enumerate(txt_list):
        if not marked[i]:      #don't check duplicates of text rows marked for removal
            continue
        for j, others in enumerate(txt_list[i+1::]): 
            if marked[i+j+1]:     # only look through vals not already marked for removal
                if rapidfuzz.fuzz.ratio(txt, others, score_cutoff = threshold):
                    marked[i+j+1] = False   # mark for removal
    return marked

chk = mark_dupes(df['col1'], threshold = 80)

dfx = df[chk]
print(dfx)

which prints

     col1  col2
0     abc     1
2     xyx     3
3  abcdef     4
6   abxdx     7
9    1234    10