Search code examples
apache-sparkpysparktypesparquetvoid

pyspark.sql.utils.AnalysisException: Parquet data source does not support void data type


I am trying to add a column in my dataframe df1 in PySpark.

The code I tried:

import pyspark.sql.functions as F
df1 = df1.withColumn("empty_column", F.lit(None))

But I get this error:

pyspark.sql.utils.AnalysisException: Parquet data source does not support void data type.

Can anyone help me with this?


Solution

  • Instead of just F.lit(None), use it with a cast and a proper data type. E.g.:

    F.lit(None).cast('string')
    
    F.lit(None).cast('double')
    

    When we add a literal null column, it's data type is void:

    from pyspark.sql import functions as F
    spark.range(1).withColumn("empty_column", F.lit(None)).printSchema()
    # root
    #  |-- id: long (nullable = false)
    #  |-- empty_column: void (nullable = true)
    

    But when saving as parquet file, void data type is not supported, so such columns must be cast to some other data type.