Search code examples
dataframepysparkfuzzywuzzy

How to use fuzz.ratio on a data frame on pyspark


I want to use fuzz.ratio on a data frame, but I'm working on pyspark (I can't use pandas).

I have the function:

from fuzzywuzzy import fuzz

I create a data frame like this:

communes_corrompues=spark.createDataFrame(
[("VILLEAINTE", "VILLEPINTE"),
('QILLEPINTE'   ,'VILLEPINTE'),
('AHIENS'   ,'AMIENS'),
('AMIEPS'   ,'AMIENS'),
("CVRGY"    ,"CERGY"),
("CERGA"    ,"CERGY")
 ],
    ['corrompue', 'resultat']
)

And this sentence doesn't work:

communes_corrompues_ratio = communes_corrompues.withColumn("fuzzywuzzy_ratio",
lit(fuzz.ratio(col("resultat"),col("corrompue"))))

I have this error:

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

May someone help me? Or know how to do it?


Solution

  • I'd try user defined functions for that, something like:

    from pyspark.sql.functions import udf
    from fuzzywuzzy import fuzz
    
    @udf("int")
    def fuzz_udf(a,b):
      return fuzz.ratio(a,b)
    
    communes_corrompues_ratio.withColumn("fuzzywuzzy_ratio", fuzz_udf(col("resultat"),col("corrompue")).show()