I have a pyspark dataframe in which I need to add "translate" for a column. I have the below code
df1 = df.withColumn("Description", F.split(F.trim(F.regexp_replace(F.regexp_replace(F.lower(F.col("Short_Description")), \
r"[/\[/\]/\{}!-]", ' '), ' +', ' ')), ' '))\
df2 = df1.withColumn("Description", F.translate('Description', 'ãäöüẞáäčďéěíĺľňóôŕšťúůýžÄÖÜẞÁÄČĎÉĚÍĹĽŇÓÔŔŠŤÚŮÝŽ',
'aaousaacdeeillnoorstuuyzAOUSAACDEEILLNOORSTUUYZ'))
df3 = df2.withColumn('Description', F.explode(F.col('Description')))
I'm getting datatype mismatch error: argument 1 requires string type, 'Description' is of array<string> type
I need to handle the accented letters in Description column.
Please let me know how to solve this
Try using spark higher order functions transform
to iterate through array and replace.
Example:
from pyspark.sql.functions import *
df= spark.createDataFrame([(1,['123a','2431abc'])],['id','description'])
df.withColumn("description",expr("""transform(description,x -> translate(x,'abc',''))""")).display()
#result:
#+---+-----------+
#| id|description|
#+---+-----------+
#| 1|[123, 2431]|
#+---+-----------+