Search code examples
pythonpandasapache-sparkpysparkdatabricks

Databricks: Issue while creating spark data frame from pandas


I have a pandas data frame which I want to convert into spark data frame. Usually, I use the below code to create spark data frame from pandas but all of sudden I started to get the below error, I am aware that pandas has removed iteritems() but my current pandas version is 2.0.0 and also I tried to install lesser version and tried to created spark df but I still get the same error. The error invokes inside the spark function. What is the solution for this? which pandas version should I install in order to create spark df. I also tried to change the runtime of cluster databricks and tried re running but I still get the same error.

import pandas as pd
spark.createDataFrame(pd.DataFrame({'i':[1,2,3],'j':[1,2,3]}))

error:-
UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  'DataFrame' object has no attribute 'iteritems'
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warn(msg)
AttributeError: 'DataFrame' object has no attribute 'iteritems'

Solution

  • It's related to the Databricks Runtime (DBR) version used - the Spark versions in up to DBR 12.2 rely on .iteritems function to construct a Spark DataFrame from Pandas DataFrame. This issue was fixed in the Spark 3.4 that is available as DBR 13.x.

    If you can't upgrade to DBR 13.x, then you need to downgrade the Pandas to latest 1.x version (1.5.3 right now) by using %pip install -U pandas==1.5.3 command in your notebook. Although it's just better to use Pandas version shipped with your DBR - it was tested for compatibility with other packages in DBR.