I want to perform a join based on Levenshtein distance.
I have 2 tables:
Data
: Which is a CSV in HDFS file repository. one of the columns is Disease description
, 15K rows. df7_ct_map
: a table I call from Hive. one of the columns is Disease Indication
, 20K rows. I'm trying to join both tables by matching each description with the indication (they are text descriptions of sicknesses). Ideally they need to be the same, but if both texts are different I wish to select matching text containing the maximum number of common words.
from pyspark.sql.functions import levenshtein
joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"),
Data("Indication")) < 3)
joinedDF.show(10)
The problem is Data
is a DataFrame
which is why I obtain the following error:
TypeError: 'DataFrame' object is not callable
TypeError Traceback (most recent call last)
in engine
----> 1 joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"), Data("Indication")) < 3)
TypeError: 'DataFrame' object is not callable
Some advice? Can I use Fuzzywuzzy package? If so, how?
Instead of joining using this the other option is as below
newDF=df1.join(df2,levenshtein(df1['description'], df2['description']) < 3)
This will allow a difference of at most 2 character while joining the 2 data frames.
might this is helpful.