Replace set of values in a column with NULL

I have a dataframe having ~100 columns, with various types including StringType(), IntegerType(), BooleanType(), ArrayType(StringType()). Few of these columns (including boolean/array types) have junk values from data provider such as junk,NULL,default..etc. I need to replace these values with NULL , not NULL string.

values_to_replace = ["junk", "NULL","default"]
replacement_value = None

for column in df.columns:
    df = df.withColumn(column, when(col(column).isin(values_to_replace), replacement_value).otherwise(col(column)))

Question: Above part of replacement works fine for string type, but running into error when dealing with columns having non string datatypes such as boolean/array. Running into below error.

Cannot resolve "(col_name1 IN (junk, NULL, default))" due to data type mismatch: Input to in should all be the same type, but it's ["ARRAY", "STRING", "STRING", "STRING"].;

Cannot resolve "(col_name2 IN (junk, NULL, default))" due to data type mismatch: Input to in should all be the same type, but it's ["BOOLEAN", "STRING", "STRING", "STRING"].;

How can we handle this replacement for all datatypes?

Solution

If you want to use a common logic for all data types using isin, you can convert non-string columns to string type temporarily for the replacement operation

from pyspark.sql.functions import when, col
from pyspark.sql.types import StringType
values_to_replace = ["junk", "NULL", "default"]
replacement_value = None
for column in df.columns:
# Convert non-string columns to StringType for replacement
    df = df.withColumn(
       column,
       when(col(column).cast(StringType()).isin(values_to_replace),
       replacement_value).otherwise(
       col(column)
       )
)
# Ensure consistent NULL representation for all data types
df = df.na.fill(replacement_value)