Hi I have a pandas dataframe named df , where few of the columns contain list of strings.
id colname colname1
a1 [] []
a2 [] []
a3 [] ['anc','asf']
I want to write it into delta table. As per the schema of the table, the datatype of colname and colname1 are array.
But as you can see colname doesn't contain any data, so when I'm trying to write it into the table. it is giving me this error:
AnalysisException: Found nested NullType in column 'colname' which is of ArrayType. Delta doesn't support writing NullType in complex types.
This is the code for writing it to table.
spark_df = spark.createDataFrame(df)
spark_df.write.mode("append").option("overwriteSchema", "true").saveAsTable("dbname.tbl_name")
I tried to search everywhere but didn't find the solution.
What can I do so that even if the colname column is entirely empty(as in this case) the data should be successfully inserted in the table.
If your column contains only empty arrays, Spark cannot tell whether it would be array of ints or strings or whatever - finally it considers array of nulls.
Provide schema explicitly when creating DataFrame:
from pyspark.sql.types import *
schema = StructType([
StructField("id", StringType(), True),
StructField("colname", ArrayType(StringType()), True),
StructField("colname1", ArrayType(StringType()), True)
])
spark.createDataFrame(df, schema)