I have a piece of code where at the end, I am write dataframe to parquet file.
The logic is such that the dataframe could be empty sometimes and hence I get the below error.
df.write.format("parquet").mode("overwrite").save(somePath)
org.apache.spark.sql.AnalysisException: Parquet data source does not support null data type.;
When I print the schema of "df", I get below.
df.schema
res2: org.apache.spark.sql.types.StructType =
StructType(
StructField(rpt_date_id,IntegerType,true),
StructField(rpt_hour_no,ShortType,true),
StructField(kpi_id,IntegerType,false),
StructField(kpi_scnr_cd,StringType,false),
StructField(channel_x_id,IntegerType,false),
StructField(brand_id,ShortType,true),
StructField(kpi_value,FloatType,false),
StructField(src_lst_updt_dt,NullType,true),
StructField(etl_insrt_dt,DateType,false),
StructField(etl_updt_dt,DateType,false)
)
Is there a workaround to just write the empty file with schema, or not write the file at all when empty? Thanks
The error you are getting is not related with the fact that your dataframe is empty. I don't see the point of saving an empty dataframe but you can do it if you want. Try this if you don't believe me:
val schema = StructType(
Array(
StructField("col1",StringType,true),
StructField("col2",StringType,false)
)
)
spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
.write
.format("parquet")
.save("/tmp/test_empty_df")
You are getting that error because one of your columns is of NullType and as the exception that was thrown indicates "Parquet data source does not support null data type"
I can't know for sure why you have a column with Null type but that usually happens when you read your data from a source and let spark infer the schema. If in that source there is an empty column, spark won't be able to infer the schema and will set it to null type.
If this is what's happening, my advice is that you specify the schema on read.
If this is not the case, a possible solution is to cast all the columns of NullType to a parquet-compatible type (like StringType). Here is an example on how to do it:
//df is a dataframe with a column of NullType
val df = Seq(("abc",null)).toDF("col1", "col2")
df.printSchema
root
|-- col1: string (nullable = true)
|-- col2: null (nullable = true)
//fold left to cast all NullType to StringType
val df1 = df.columns.foldLeft(df){
(acc,cur) => {
if(df.schema(cur).dataType == NullType)
acc.withColumn(cur, col(cur).cast(StringType))
else
acc
}
}
df1.printSchema
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
Hope this helps