python optimization parallel-processing dask fuzzywuzzy

How to reduce the processing time in a function that compares two sentences from two different dataframes?

I am using a function to compare the sentences of two dataframes and extract the value and sentence with the highest similarity:

df1 : containing 40,000 sentences
df2 : containing 400 sentences

Each sentence of df1 is compared against the 400 sentences of df2, the function returns a tuple that contains the sentence with the highest value and the value. If the value returned is below 96, it returns a tuple with two None values.

I have tried different algorithms for comparing the sentences (spacy and nltk), but i found the best results in terms of processing time by using fuzzywuzzy.process-instance and its process.extractOne(text1, text2)-method and dask, by partitioning df1 (into 4 partitions). Inside extractOne() I am using the option score_cutoff = 96, to return only the values >= 96, the idea was that once a value >= 96 was found, the function does not have to iterate through the whole df2, but it seems it does not work like that.

I also tried partitioning df2 inside the function, but the processing time is not better than iterating using a list comprehension in df2 (this has a comment in the code).

Here's my code:

def similitud( text1 ):
    a = process.extractOne(   text1,
                            [ df2['TITULO_PROYECTO'][i]
                                                 for i in range( len( df2 ) )
                              ],
                              score_cutoff = 96
                            )
    """
    a = process.extractOne(   text1,
                              ddf2.map_partitions(   lambda df2:
                                                            df2.apply( lambda row:
                                                                             #row['TITULO_PROYECTO'],
                                                                       axis = 1
                                                                       ),
                                                     meta = 'str'
                                                   ).compute( sheduler = 'processses' ),
                              score_cutoff #= 96
                            )
    """

    return ( None, None ) if a == None else a
    
tupla_values = ddf1.map_partitions( lambda df1:
                                           df1.progress_apply( ( lambda row:
                                                             similitud( row['TITULO_PROYECTO'] )
                                                                 ),
                                                                 axis = 1
                                                               ),
                                    meta = 'str'
                                    ).compute( scheduler = 'processes' )

How can I find a solution to reduce the processing time?

Solution

FuzzyWuzzy does not really care about the score_cutoff parameter. It will simply iterate over all elements and then search for the one with the highest score and return it in case it is above your score_cutoff treshold.

You should try to do the same using RapidFuzz_{( I am the author )}, which provides a very similar interface. As a difference, the process.extractOne()-method returns a tuple with the choice and score in FuzzyWuzzy,
whereas it returns a tuple with choice, score and the index in RapidFuzz.

RapidFuzz uses some fast approximations of the Levenshtein distance, that are guaranteed to return a score >= the actual score of the normalized Levenshtein distance. These fast approximations are used to exit early without performing the expensive Levenshtein calculation when the score is guaranteed to be below the score_cutoff. However it is not possible for it to exit after it finds the first element with a score above score_cutoff in process.extractOne()-processing either, since it searches for the extreme, for the indeed best match above the score_cutoff treshold (but it will exit early in case it finds a first match with a score of 100).

Also you do not need to create a list of all choices inside the similitud(), since both FuzzyWuzzy and RapidFuzz accept DataSeries aswell. So you could just use something like:

from rapidfuzz import process

def similitud( text1 ):
    a = process.extractOne( text1,
                            df2['TITULO_PROYECTO'],
                            score_cutoff = 96
                            )
    return ( None, None ) if a is None else ( a[0], a[1] )

process.extractOne() will always preprocess your strings by lowercasing them, replacing non alphanumeric characters like punctuation with whitespaces and removing whitespaces from the start and from the end of the strings.

You could preprocess the data in df2 beforehand, so you do not have to do this for each element of df1 that is compared. Afterwards you could only preprocess the elements of df1 on each iteration:

from rapidfuzz import process, utils

def similitud( text1 ):
    a = process.extractOne( utils.default_process( text1 ),
                            df2['PROCESSED_TITULO_PROYECTO'],
                            processor    = None,
                            score_cutoff = 96
                            )
    return ( None, None ) if a is None else ( a[0], a[1] )

I recently created a Benchmark that compares process.extractOne() for both FuzzyWuzzy-module (with Python-Levenshtein) and RapidFuzz-module (can be found here).

Benchmarks :

The source of the reproducible-science DATA :

titledata.csv from FuzzyWuzzy ( dataset with 2764 titles, max choices length: 2500 )

The hardware the above graphed benchmarks were run on (specification) :

CPU: single core of a i7-8550U
RAM: 8 GB
OS: Fedora 32