Search code examples
apache-sparkpyspark

Error while calculating pyspark dataframe size


I'm trying to calculate the DataFrame size to determine the number of partitions for repartitioning the DataFrame while writing to a Parquet file.

I found a post regarding the size estimator ( here ), below is the code

def estimate_size(spark: SparkSession, input_df: DataFrame):
    input_df.cache().foreach(lambda x: x)
    df_size_in_bytes = (
        spark._jsparkSession.sessionState()
        .executePlan(input_df._jdf.queryExecution().logical())
        .optimizedPlan()
        .stats()
        .sizeInBytes()
    )

    return df_size_in_bytes

But Im getting below error: enter image description here

I'm using PySpark version 3.5.0. I wanted to understand what is the best way to calculate the size of the partition in PySpark. Any help will be appreciated.


Solution

  • The resolution is in the same post:

    "As of this commit on Spark core, the method executionPlan asks for two parameters, the logicalPlan and mode."

    You'd have to modify your function definition as below:

    def estimate_size(spark: SparkSession, input_df: DataFrame):
        input_df.cache().foreach(lambda x: x)
        df_size_in_bytes = (
            spark._jsparkSession.sessionState()
            .executePlan(input_df._jdf.queryExecution().logical(),
                         input_df._jdf.queryExecution().mode())
            .optimizedPlan()
            .stats()
            .sizeInBytes()
        )
    
        return df_size_in_bytes
    

    Credit: 100chou