Search code examples
pysparktranslateregexp-replace

Pyspark removing multiple characters in a dataframe column


Looking at pyspark, I see translate and regexp_replace to help me a single characters that exists in a dataframe column.

I was wondering if there is a way to supply multiple strings in the regexp_replace or translate so that it would parse them and replace them with something else.

Use case: remove all $, #, and comma(,) in a column A


Solution

  • You can use pyspark.sql.functions.translate() to make multiple replacements. Pass in a string of letters to replace and another string of equal length which represents the replacement values.

    For example, let's say you had the following DataFrame:

    import pyspark.sql.functions as f
    df = sqlCtx.createDataFrame([("$100,00",),("#foobar",),("foo, bar, #, and $",)], ["A"])
    df.show()
    #+------------------+
    #|                 A|
    #+------------------+
    #|           $100,00|
    #|           #foobar|
    #|foo, bar, #, and $|
    #+------------------+
    

    and wanted to replace ('$', '#', ',') with ('X', 'Y', 'Z'). Simply use translate like:

    df.select("A", f.translate(f.col("A"), "$#,", "XYZ").alias("replaced")).show()
    #+------------------+------------------+
    #|                 A|          replaced|
    #+------------------+------------------+
    #|           $100,00|           X100Z00|
    #|           #foobar|           Yfoobar|
    #|foo, bar, #, and $|fooZ barZ YZ and X|
    #+------------------+------------------+
    

    If instead you wanted to remove all instances of ('$', '#', ','), you could do this with pyspark.sql.functions.regexp_replace().

    df.select("A", f.regexp_replace(f.col("A"), "[\$#,]", "").alias("replaced")).show()
    #+------------------+-------------+
    #|                 A|     replaced|
    #+------------------+-------------+
    #|           $100,00|        10000|
    #|           #foobar|       foobar|
    #|foo, bar, #, and $|foo bar  and |
    #+------------------+-------------+
    

    The pattern "[\$#,]" means match any of the characters inside the brackets. The $ has to be escaped because it has a special meaning in regex.