Search code examples
pythonpysparkapache-spark-sqllevenshtein-distancefuzzywuzzy

Pyspark levenshtein Join error


I want to perform a join based on Levenshtein distance.

I have 2 tables:

  1. Data: Which is a CSV in HDFS file repository. one of the columns is Disease description, 15K rows.
  2. df7_ct_map: a table I call from Hive. one of the columns is Disease Indication, 20K rows.

I'm trying to join both tables by matching each description with the indication (they are text descriptions of sicknesses). Ideally they need to be the same, but if both texts are different I wish to select matching text containing the maximum number of common words.

from pyspark.sql.functions import levenshtein  
joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"), 
Data("Indication")) < 3)
joinedDF.show(10)

The problem is Data is a DataFrame which is why I obtain the following error:

TypeError: 'DataFrame' object is not callable
TypeError                                 Traceback (most recent call last)
in engine
----> 1 joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"), Data("Indication")) < 3)

TypeError: 'DataFrame' object is not callable

Some advice? Can I use Fuzzywuzzy package? If so, how?


Solution

  • Instead of joining using this the other option is as below

    newDF=df1.join(df2,levenshtein(df1['description'], df2['description']) < 3)
    

    This will allow a difference of at most 2 character while joining the 2 data frames.

    might this is helpful.