I am trying to add a column in my dataframe df1
in PySpark.
The code I tried:
import pyspark.sql.functions as F
df1 = df1.withColumn("empty_column", F.lit(None))
But I get this error:
pyspark.sql.utils.AnalysisException: Parquet data source does not support void data type.
Can anyone help me with this?
Instead of just F.lit(None)
, use it with a cast
and a proper data type. E.g.:
F.lit(None).cast('string')
F.lit(None).cast('double')
When we add a literal null column, it's data type is void:
from pyspark.sql import functions as F
spark.range(1).withColumn("empty_column", F.lit(None)).printSchema()
# root
# |-- id: long (nullable = false)
# |-- empty_column: void (nullable = true)
But when saving as parquet file, void data type is not supported, so such columns must be cast
to some other data type.