Search code examples
pythonduplicatesstring-matchingfuzzy-search

Python Fuzzy matching strings in list performance


I'm checking if there are similar results (fuzzy match) in 4 same dataframe columns, and I have the following code, as an example. When I apply it to the real 40.000 rows x 4 columns dataset, keeps running in eternum. The issue is that the code is too slow. For example, if I limite the dataset to 10 users, it takes 8 minutes to compute, while for 20, 19 minutes. Is there anything I am missing? I do not know why this take that long. I expect to have all results, maximum in 2 hours or less. Any hint or help would be greatly appreciated.

from fuzzywuzzy import process
dataframecolumn = ["apple","tb"]
compare = ["adfad","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
result = list()
for ratio in Ratios:
    for match in ratio:
        if match[1] != 100:
            result.append(match)
            break
print (result) 

Output: [('asple', 80), ('tab', 80)]


Solution

  • Major speed improvements come by writing vectorized operations and avoiding loops

    Importing necessary package

    from fuzzywuzzy import fuzz
    import pandas as pd
    import numpy as np
    

    Creating dataframe from first list

    dataframecolumn = pd.DataFrame(["apple","tb"])
    dataframecolumn.columns = ['Match']
    

    Creating dataframe from second list

    compare = pd.DataFrame(["adfad","apple","asple","tab"])
    compare.columns = ['compare']
    

    Merge - Cartesian product by introducing key(self join)

    dataframecolumn['Key'] = 1
    compare['Key'] = 1
    combined_dataframe = dataframecolumn.merge(compare,on="Key",how="left")
    combined_dataframe = combined_dataframe[~(combined_dataframe.Match==combined_dataframe.compare)]
    

    Vectorization

    def partial_match(x,y):
        return(fuzz.ratio(x,y))
    partial_match_vector = np.vectorize(partial_match)
    

    Using vectorization and getting desired result by putting threshold on score

    combined_dataframe['score']=partial_match_vector(combined_dataframe['Match'],combined_dataframe['compare'])
    combined_dataframe = combined_dataframe[combined_dataframe.score>=80]
    

    Results

    +--------+-----+--------+------+
    | Match  | Key | compare | score
    +--------+-----+--------+------+
    | apple  | 1   |   asple |    80
    |  tb    | 1   |   tab   |    80
    +--------+-----+--------+------+