I have a dataframe having ~100 columns, with various types including StringType(), IntegerType(), BooleanType(), ArrayType(StringType()). Few of these columns (including boolean/array types) have junk values from data provider such as junk,NULL,default..etc. I need to replace these values with NULL , not NULL string.
values_to_replace = ["junk", "NULL","default"]
replacement_value = None
for column in df.columns:
df = df.withColumn(column, when(col(column).isin(values_to_replace), replacement_value).otherwise(col(column)))
Question: Above part of replacement works fine for string type, but running into error when dealing with columns having non string datatypes such as boolean/array. Running into below error.
Cannot resolve "(col_name1 IN (junk, NULL, default))" due to data type mismatch: Input to in
should all be the same type, but it's ["ARRAY", "STRING", "STRING", "STRING"].;
Cannot resolve "(col_name2 IN (junk, NULL, default))" due to data type mismatch: Input to in
should all be the same type, but it's ["BOOLEAN", "STRING", "STRING", "STRING"].;
How can we handle this replacement for all datatypes?
If you want to use a common logic for all data types using isin, you can convert non-string columns to string type temporarily for the replacement operation
from pyspark.sql.functions import when, col
from pyspark.sql.types import StringType
values_to_replace = ["junk", "NULL", "default"]
replacement_value = None
for column in df.columns:
# Convert non-string columns to StringType for replacement
df = df.withColumn(
column,
when(col(column).cast(StringType()).isin(values_to_replace),
replacement_value).otherwise(
col(column)
)
)
# Ensure consistent NULL representation for all data types
df = df.na.fill(replacement_value)