Search code examples
pythonpandasapache-sparkpysparkdatabricks

Anyone know how to display a pandas dataframe in Databricks?


Previously I had a pandas dataframe that I could display as a table in Databricks using:

df.display()

Pandas was updated to v2.0.0. today and I am now getting the following error when I run df.display():

AttributeError: 'DataFrame' object has no attribute 'iteritems'

Anyone know how I can resolve this?

I tried running df.display (without parenthesis) and it gives an output but I am looking for an output in the tabular form.


Solution

  • As a workaround, downgrade to pandas v1.5

    %pip install --upgrade pandas==1.5
    

    The answers provided till now used to work prior to 3rd April 2023.

    As of April 4, with pandas 2.0.0, you are not able to convert a Pandas DataFrame to a Spark DataFrame using the command:

    spark.createDataFrame(df)
    

    Using the above command leads to the error mentioned in the question:

    AttributeError: 'DataFrame' object has no attribute 'iteritems'
    

    The iteritems function seems to have been removed in pandas 2.0.0. From the changelog of pandas 2.0.0:

    Removed deprecated Series.iteritems(), DataFrame.iteritems(), use obj.items instead
    

    While the code written in spark to convert pandas dataframe to a spark dataframe still uses iteritems

    /databricks/spark/python/pyspark/sql/pandas/conversion.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
        308                     warnings.warn(msg)
        309                     raise
    --> 310         data = self._convert_from_pandas(data, schema, timezone)
        311         return self._create_dataframe(data, schema, samplingRatio, verifySchema)
        312 
    
    /databricks/spark/python/pyspark/sql/pandas/conversion.py in _convert_from_pandas(self, pdf, schema, timezone)
        340                             pdf[field.name] = s
        341             else:
    --> 342                 for column, series in pdf.iteritems():
        343                     s = _check_series_convert_timestamps_tz_local(series, timezone)
        344                     if s is not series:
    

    Looks like we will have to wait for a fix to use Pandas 2.0.0.