Search code examples
dataframeapache-sparkpysparkdatabricksazure-databricks

Can you construct pyspark.pandas.DataFrame from pyspark.sql.dataframe.DataFrame?


I am new to Spark / Databricks. My question is whether is it recommended / possible to mix sql and Pandas API dataframes? Is it possible to create a pyspark.pandas.DataFrame directly from a pyspark.sql.dataframe.DataFrame, or I need to re-read the parquet file?

# Suppose you have an SQL dataframe (now I read Boston Safety Data from Microsoft Open Dataset)
blob_account_name = "azureopendatastorage"
blob_container_name = "citydatacontainer"
blob_relative_path = "Safety/Release/city=Boston"
blob_sas_token = r""

wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name), blob_sas_token)
print('Remote blob path: ' + wasbs_path)

df = spark.read.parquet(wasbs_path)

# Convert df to pyspark.pandas.Dataframe
df2 =   # ...?

Tried df.toPandas(), that is not good, because it converts to plain, undistributed pandas.core.frame.DataFrame.

A workaround is to read the parquet again into a pyspark.pandas.Dataframe which I try to avoid.

Thanks!


Solution

  • IIUC you are looking to convert a spark dataframe to a pandas on spark dataframe.

    EDIT: as per Yashash comment, the pandas_api method is now preferred, available since Spark 3.2.

    Before 3.2 you can use the to_pandas_on_spark method.

    df2 = df.to_pandas_on_spark()
    
    print(type(df2))
    
    <class 'pyspark.pandas.frame.DataFrame'>