Search code examples
pythonapache-sparkpyspark

How to handle accented letter in Pyspark


I have a pyspark dataframe in which I need to add "translate" for a column. I have the below code

df1 = df.withColumn("Description", F.split(F.trim(F.regexp_replace(F.regexp_replace(F.lower(F.col("Short_Description")), \
        r"[/\[/\]/\{}!-]", ' '), ' +', ' ')), ' '))\
        
df2 = df1.withColumn("Description", F.translate('Description', 'ãäöüẞáäčďéěíĺľňóôŕšťúůýžÄÖÜẞÁÄČĎÉĚÍĹĽŇÓÔŔŠŤÚŮÝŽ',
                                       'aaousaacdeeillnoorstuuyzAOUSAACDEEILLNOORSTUUYZ'))
                                       
df3 = df2.withColumn('Description', F.explode(F.col('Description')))

I'm getting datatype mismatch error: argument 1 requires string type, 'Description' is of array<string> type

I need to handle the accented letters in Description column.

Please let me know how to solve this


Solution

  • Try using spark higher order functions transform to iterate through array and replace.

    Example:

    from pyspark.sql.functions import *
    
    df= spark.createDataFrame([(1,['123a','2431abc'])],['id','description'])
    
    df.withColumn("description",expr("""transform(description,x -> translate(x,'abc',''))""")).display()
    
    #result:
    #+---+-----------+
    #| id|description|
    #+---+-----------+
    #|  1|[123, 2431]|
    #+---+-----------+