Search code examples
apache-sparkpyspark

Replace set of values in a column with NULL


I have a dataframe having ~100 columns, with various types including StringType(), IntegerType(), BooleanType(), ArrayType(StringType()). Few of these columns (including boolean/array types) have junk values from data provider such as junk,NULL,default..etc. I need to replace these values with NULL , not NULL string.

values_to_replace = ["junk", "NULL","default"]
replacement_value = None

for column in df.columns:
    df = df.withColumn(column, when(col(column).isin(values_to_replace), replacement_value).otherwise(col(column)))

Question: Above part of replacement works fine for string type, but running into error when dealing with columns having non string datatypes such as boolean/array. Running into below error.

Cannot resolve "(col_name1 IN (junk, NULL, default))" due to data type mismatch: Input to in should all be the same type, but it's ["ARRAY", "STRING", "STRING", "STRING"].;

Cannot resolve "(col_name2 IN (junk, NULL, default))" due to data type mismatch: Input to in should all be the same type, but it's ["BOOLEAN", "STRING", "STRING", "STRING"].;

How can we handle this replacement for all datatypes?


Solution

  • If you want to use a common logic for all data types using isin, you can convert non-string columns to string type temporarily for the replacement operation

    from pyspark.sql.functions import when, col
    from pyspark.sql.types import StringType
    values_to_replace = ["junk", "NULL", "default"]
    replacement_value = None
    for column in df.columns:
    # Convert non-string columns to StringType for replacement
        df = df.withColumn(
           column,
           when(col(column).cast(StringType()).isin(values_to_replace),
           replacement_value).otherwise(
           col(column)
           )
    )
    # Ensure consistent NULL representation for all data types
    df = df.na.fill(replacement_value)