Search code examples
regexapache-sparkpysparknullregexp-replace

Why does every df value change when using Spark regexp_replace()?


I want to use regexp_replace() in PySpark to convert all question marks and back slashes in my data frame to null values. This is the code I used:

question = "?"
empty_str = "\\\"\\\""

for column in df.columns:
     df = df.withColumn(column, regexp_replace(column, question, None)
     df = df.withColumn(column, regexp_replace(column, empty_str, None)

However, when I use this code all the values in my dataframe turn into null values - not just the question marks and back slashes. Is there a way I can change my code to fix this?


Solution

  • With regexp_replace you cannot replace values to null, you will need another method, e.g. replace

    from pyspark.sql import functions as F
    df = spark.createDataFrame([("?",), ("\\",), ("b",)], ["col_name"])
    df.show()
    # +--------+
    # |col_name|
    # +--------+
    # |       ?|
    # |       \|
    # |       b|
    # +--------+
    
    pattern = r"^[?\\]+$"
    df = df.withColumn("col_name", F.regexp_replace("col_name", pattern, "")) \
           .replace("", None, "col_name")
    df.show()
    # +--------+
    # |col_name|
    # +--------+
    # |    null|
    # |    null|
    # |       b|
    # +--------+
    

    In your attempt, every value changed to null, because you incorrectly provided None to the replacement argument, instead of str. Only str is accepted, according to the documentation.

    pyspark.sql.functions.regexp_replace(str: ColumnOrName, pattern: str, replacement: str) → pyspark.sql.column.Column