Search code examples
apache-sparkpysparknlpnltksql-function

How to replace special charachters in Pyspark?


I am fairly new to Pyspark, and I am trying to do some text pre-processing with Pyspark. I have a column Name and ZipCode that belongs to a spark data frame new_df. The column 'Name' contains values like WILLY:S MALMÖ, EMPORIA and ZipCode contains values like 123 45 which is a string too. what I want to do is I want to remove characters like :, , etc and want to remove space between the ZipCode. I tried the following but nothing seems to work :

new_df = new_df.withColumn('Name', sfn.regexp_replace('Name', r',' , ' '))
new_df = new_df.withColumn('ZipCode', sfn.regexp_replace('ZipCode', r' ' , ''))

I tried other things too from the SO and other websites. Nothing seems to work.


Solution

  • Use [,|:] to match , or : and replace with space ' ' in Name column and for zipcode search for space ' ' and replace with empty string ''.

    Example:

    new_df.show(10,False)
    #+-----------------------+-------+
    #|Name                   |ZipCode|
    #+-----------------------+-------+
    #|WILLY:S MALMÖ, EMPORIA|123 45 |
    #+-----------------------+-------+
    new_df.withColumn('Name', regexp_replace('Name', r'[,|:]' , ' ')).\
    withColumn('ZipCode', regexp_replace('ZipCode', r' ' , '')).\
    show(10,False)
    #or
    new_df.withColumn('Name', regexp_replace('Name', '[,|:]' , ' ')).\
    withColumn('ZipCode', regexp_replace('ZipCode', '\s+' , '')).\
    show(10,False)
    #+-----------------------+-------+
    #|Name                   |ZipCode|
    #+-----------------------+-------+
    #|WILLY S MALMÖ  EMPORIA|12345  |
    #+-----------------------+-------+