azure apache-spark pyspark databricks azure-databricks

UserWarning: createDataFrame attempted Arrow optimization in pyspark createDataFrame

In Azure DataBricks with Runtime 12.2 LTS ML (includes Apache Spark 3.3.2, Scala 2.12) I am trying to run following script:

import pandas as pd
example = pd.DataFrame([{'a':[{'b':'c'}]}])
from pyspark.sql.types import *
schema = StructType([
    StructField("a", ArrayType(MapType(StringType(),StringType())), True),
    ])
query_df = spark.createDataFrame(example, schema)
display(query_df)

The code execution returns following warning:

/databricks/spark/python/pyspark/sql/pandas/conversion.py:467: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  Could not convert {'b': 'c'} with type dict: was not a sequence or recognized null for conversion to list type
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.

Despite many attempts with various schema variations, I keep on getting this warning.

a) Why does it try to cast dict into list?

b) How to make it work correctly and with optimization while having "list of dicts" as data type in this column?

[EDIT] Cluster configuration: Personal Compute 12.2 LTS ML (includes Apache Spark 3.3.2, Scala 2.12) Standard_DS3_v2 Spark config: spark.databricks.cluster.profile singleNode spark.master local[*, 4]

Solution

The error you are getting is because of the datatype.

Below are the datatypes which are not supported by Arrow-based conversion:

MapType, ArrayType of TimestampType, and nested StructType.

So, if you give MapType and use it for conversion, you will get an error.

enter image description here

Even for nested StructType.

import pandas as pd
example = pd.DataFrame([{'a':[{'b':{'c':'d'}}]}])
from pyspark.sql.types import *

schema = StructType([
    StructField("a", StructType([
        StructField("b", StructType([
            StructField("c", StringType(), nullable=True)
        ]), nullable=True)
    ]), nullable=True)
])


query_df2 = spark.createDataFrame(example, schema)
query_df2.toPandas()

Error:

/databricks/spark/python/pyspark/sql/pandas/conversion.py:467: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  Unable to convert the field a. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.
Direct cause: Nested StructType not supported in conversion to Arrow
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warn(msg)
/databricks/spark/python/pyspark/sql/pandas/conversion.py:122: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  Unable to convert the field a. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.
Direct cause: Nested StructType not supported in conversion to Arrow
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.

Below are the schemas with and without MapType.

query_df1 = spark.createDataFrame(example)
query_df2 = spark.createDataFrame(example, schema)
query_df2.printSchema()
query_df1.printSchema()

Output:

root
 |-- a: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)

root
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b: string (nullable = true)

So, use a schema that is supported. For more information, refer to this documentation.