I have a dataframe with a boolean column and I want to fill the missing values with False.
However, when I use fillna method, nothing happens:
df = spark.createDataFrame([(True,), (True,), (None,), (None,)], ['col'])
df.fillna(False).show()
The output is
+----+
| col|
+----+
|true|
|true|
|null|
|null|
+----+
But when I do it manually, the values are filled in:
from pyspark.sql import functions as fn
df.withColumn("col", fn.when(fn.col("col").isNull(), False).otherwise(fn.col("col"))).show()
+-----+
| col|
+-----+
| true|
| true|
|false|
|false|
+-----+
Does anyone know why and how to fix this?
Fillna for Boolean columns were introduced in Spark 2.3.0. I suppose you're using an older version of Spark, which does not support Boolean fillna yet.
See the docs for Spark 2.2.0 and Spark 2.3.0 respectively to check the differences.
The way to fix is either to upgrade your Spark version, or to use your code. Another way is to use coalesce
, e.g.
import pyspark.sql.functions as F
df2 = df.withColumn("col", F.coalesce(F.col("col"), F.lit(False)))