I am using a function to compare the sentences of two dataframes and extract the value and sentence with the highest similarity:
df1
: containing 40,000 sentencesdf2
: containing 400 sentencesEach sentence of df1
is compared against the 400 sentences of df2
, the function returns a tuple that contains the sentence with the highest value and the value. If the value returned is below 96, it returns a tuple with two None
values.
I have tried different algorithms for comparing the sentences (spacy
and nltk
), but i found the best results in terms of processing time by using fuzzywuzzy.process
-instance and its process.extractOne(text1, text2)
-method and dask
, by partitioning df1
(into 4 partitions). Inside extractOne()
I am using the option score_cutoff = 96
, to return only the values >= 96
, the idea was that once a value >= 96
was found, the function does not have to iterate through the whole df2
, but it seems it does not work like that.
I also tried partitioning df2
inside the function, but the processing time is not better than iterating using a list comprehension in df2
(this has a comment in the code).
Here's my code:
def similitud( text1 ):
a = process.extractOne( text1,
[ df2['TITULO_PROYECTO'][i]
for i in range( len( df2 ) )
],
score_cutoff = 96
)
"""
a = process.extractOne( text1,
ddf2.map_partitions( lambda df2:
df2.apply( lambda row:
#row['TITULO_PROYECTO'],
axis = 1
),
meta = 'str'
).compute( sheduler = 'processses' ),
score_cutoff #= 96
)
"""
return ( None, None ) if a == None else a
tupla_values = ddf1.map_partitions( lambda df1:
df1.progress_apply( ( lambda row:
similitud( row['TITULO_PROYECTO'] )
),
axis = 1
),
meta = 'str'
).compute( scheduler = 'processes' )
How can I find a solution to reduce the processing time?
FuzzyWuzzy does not really care about the score_cutoff
parameter. It will simply iterate over all elements and then search for the one with the highest score and return it in case it is above your score_cutoff
treshold.
You should try to do the same using RapidFuzz( I am the author ), which provides a very similar interface. As a difference, the process.extractOne()
-method returns a tuple with the choice
and score
in FuzzyWuzzy,
whereas it returns a tuple with choice
, score
and the index
in RapidFuzz.
RapidFuzz uses some fast approximations of the Levenshtein distance, that are guaranteed to return a score >= the actual score of the normalized Levenshtein distance. These fast approximations are used to exit early without performing the expensive Levenshtein calculation when the score is guaranteed to be below the score_cutoff. However it is not possible for it to exit after it finds the first element with a score above score_cutoff
in process.extractOne()
-processing either, since it searches for the extreme, for the indeed best match above the score_cutoff
treshold (but it will exit early in case it finds a first match with a score of 100).
Also you do not need to create a list of all choices inside the similitud()
, since both FuzzyWuzzy and RapidFuzz accept DataSeries aswell. So you could just use something like:
from rapidfuzz import process
def similitud( text1 ):
a = process.extractOne( text1,
df2['TITULO_PROYECTO'],
score_cutoff = 96
)
return ( None, None ) if a is None else ( a[0], a[1] )
process.extractOne()
will always preprocess your strings by lowercasing them, replacing non alphanumeric characters like punctuation with whitespaces and removing whitespaces from the start and from the end of the strings.
You could preprocess the data in df2
beforehand, so you do not have to do this for each element of df1
that is compared. Afterwards you could only preprocess the elements of df1
on each iteration:
from rapidfuzz import process, utils
def similitud( text1 ):
a = process.extractOne( utils.default_process( text1 ),
df2['PROCESSED_TITULO_PROYECTO'],
processor = None,
score_cutoff = 96
)
return ( None, None ) if a is None else ( a[0], a[1] )
I recently created a Benchmark that compares process.extractOne()
for both FuzzyWuzzy
-module (with Python-Levenshtein) and RapidFuzz
-module (can be found here).
The source of the reproducible-science DATA :
titledata.csv
from FuzzyWuzzy ( dataset with 2764 titles, max choices length: 2500 )The hardware the above graphed benchmarks were run on (specification) :