I want to use fuzz.ratio on a data frame, but I'm working on pyspark (I can't use pandas).
I have the function:
from fuzzywuzzy import fuzz
I create a data frame like this:
communes_corrompues=spark.createDataFrame(
[("VILLEAINTE", "VILLEPINTE"),
('QILLEPINTE' ,'VILLEPINTE'),
('AHIENS' ,'AMIENS'),
('AMIEPS' ,'AMIENS'),
("CVRGY" ,"CERGY"),
("CERGA" ,"CERGY")
],
['corrompue', 'resultat']
)
And this sentence doesn't work:
communes_corrompues_ratio = communes_corrompues.withColumn("fuzzywuzzy_ratio",
lit(fuzz.ratio(col("resultat"),col("corrompue"))))
I have this error:
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
May someone help me? Or know how to do it?
I'd try user defined functions for that, something like:
from pyspark.sql.functions import udf
from fuzzywuzzy import fuzz
@udf("int")
def fuzz_udf(a,b):
return fuzz.ratio(a,b)
communes_corrompues_ratio.withColumn("fuzzywuzzy_ratio", fuzz_udf(col("resultat"),col("corrompue")).show()