I am fairly new to Pyspark, and I am trying to do some text pre-processing with Pyspark.
I have a column Name
and ZipCode
that belongs to a spark data frame new_df
. The column 'Name' contains values like WILLY:S MALMÖ, EMPORIA
and ZipCode
contains values like 123 45
which is a string too. what I want to do is I want to remove characters like :
, ,
etc and want to remove space between the ZipCode
.
I tried the following but nothing seems to work :
new_df = new_df.withColumn('Name', sfn.regexp_replace('Name', r',' , ' '))
new_df = new_df.withColumn('ZipCode', sfn.regexp_replace('ZipCode', r' ' , ''))
I tried other things too from the SO and other websites. Nothing seems to work.
Use [,|:]
to match ,
or :
and replace with space ' '
in Name column and for zipcode search for space ' '
and replace with empty string ''
.
Example:
new_df.show(10,False)
#+-----------------------+-------+
#|Name |ZipCode|
#+-----------------------+-------+
#|WILLY:S MALMÖ, EMPORIA|123 45 |
#+-----------------------+-------+
new_df.withColumn('Name', regexp_replace('Name', r'[,|:]' , ' ')).\
withColumn('ZipCode', regexp_replace('ZipCode', r' ' , '')).\
show(10,False)
#or
new_df.withColumn('Name', regexp_replace('Name', '[,|:]' , ' ')).\
withColumn('ZipCode', regexp_replace('ZipCode', '\s+' , '')).\
show(10,False)
#+-----------------------+-------+
#|Name |ZipCode|
#+-----------------------+-------+
#|WILLY S MALMÖ EMPORIA|12345 |
#+-----------------------+-------+