Search code examples
pythonpandasapache-sparkpysparkapache-spark-sql

Not able to write spark dataframe. Error Found nested NullType in column 'colname' which is of ArrayType


Hi I have a pandas dataframe named df , where few of the columns contain list of strings.

id    colname    colname1
a1    []         []
a2    []         []
a3    []         ['anc','asf']

I want to write it into delta table. As per the schema of the table, the datatype of colname and colname1 are array.

But as you can see colname doesn't contain any data, so when I'm trying to write it into the table. it is giving me this error:

AnalysisException: Found nested NullType in column 'colname' which is of ArrayType. Delta doesn't support writing NullType in complex types.

This is the code for writing it to table.

spark_df = spark.createDataFrame(df)
spark_df.write.mode("append").option("overwriteSchema", "true").saveAsTable("dbname.tbl_name")

I tried to search everywhere but didn't find the solution.

What can I do so that even if the colname column is entirely empty(as in this case) the data should be successfully inserted in the table.


Solution

  • If your column contains only empty arrays, Spark cannot tell whether it would be array of ints or strings or whatever - finally it considers array of nulls.

    Provide schema explicitly when creating DataFrame:

    from pyspark.sql.types import *
    schema = StructType([
               StructField("id", StringType(), True),
               StructField("colname", ArrayType(StringType()), True),
               StructField("colname1", ArrayType(StringType()), True)
             ])
    spark.createDataFrame(df, schema)