Search code examples
pysparkreplaceunicode-escapes

How to handle escape characters in pyspark. Trying to replace escape character with NULL


I'm trying to replace a escape character with NULL in pyspark dataframe. Data in dataframe looks like below

Col1|Col2|Col3 
1|\026\026|026|abcd026efg. 

Col2 is a garbage data and trying to replace with NULL. Tried replace and regex_replace functions to replace '\026' with Null value, because of escape character (" \ "), data is not replaced with NULL value.

 replace(col2, "026",  'abcd') 
 replace(Col2, "\026",  'abcd') 

Finally,

I want my data as

Col1|Col2|Col3 
1|NULL|026|abcd026efg. 

Highly appreciate for thoughts to resolve this scenario.

Thanks -EVR

enter image description here


Solution

  • Use replace all digits and preceding non digits

     import pyspark.sql.functions as F
     df.withColumn('col2',F.regexp_replace('col2','\D\d+',None)).show()
    
    +----+----+-----------+
    |col1|col2|       col3|
    +----+----+-----------+
    |   1|null|abcd026efg.|
    +----+----+-----------+