Search code examples
pysparkaws-glueregexp-replace

Not able to remove ( ) in pyspark


I am receiving data like 7,432,818 (Imps) and need to load data in column having type decimal(20,3) I am trying to remove '(Imps)' but brackets are not getting removed using refexp_replace

I am using below code

validated_df=validated_df.withColumn('MeasurePer', F.regexp_replace('MeasurePer', ',', ''))
validated_df=validated_df.withColumn('MeasurePer', F.regexp_replace('MeasurePer', '(Imps)', ''))

Result getting as:

7432818 ()

Solution

  • I think all you need is escape characters before \(Imps\)

    validated_df=validated_df.withColumn('MeasurePer', F.regexp_replace('MeasurePer', '\(Imps\)', ''))
    

    (Or)

    Try with this or(i.e.|) condition in regular expressions.

    df=spark.createDataFrame([('7,432,818 (Imps)',)],['dec'])
    
    df=df.withColumn("dec",regexp_replace(col("dec"),"(,|\(Imps\))",""))
    
    
    +--------+
    |     dec|
    +--------+
    |7432818 |
    +--------+