Error while calculating pyspark dataframe size

I'm trying to calculate the DataFrame size to determine the number of partitions for repartitioning the DataFrame while writing to a Parquet file.

I found a post regarding the size estimator ( here ), below is the code

def estimate_size(spark: SparkSession, input_df: DataFrame):
    input_df.cache().foreach(lambda x: x)
    df_size_in_bytes = (
        spark._jsparkSession.sessionState()
        .executePlan(input_df._jdf.queryExecution().logical())
        .optimizedPlan()
        .stats()
        .sizeInBytes()
    )

    return df_size_in_bytes

But Im getting below error:

I'm using PySpark version 3.5.0. I wanted to understand what is the best way to calculate the size of the partition in PySpark. Any help will be appreciated.

Solution

The resolution is in the same post:

"As of this commit on Spark core, the method executionPlan asks for two parameters, the logicalPlan and mode."

You'd have to modify your function definition as below:

def estimate_size(spark: SparkSession, input_df: DataFrame):
    input_df.cache().foreach(lambda x: x)
    df_size_in_bytes = (
        spark._jsparkSession.sessionState()
        .executePlan(input_df._jdf.queryExecution().logical(),
                     input_df._jdf.queryExecution().mode())
        .optimizedPlan()
        .stats()
        .sizeInBytes()
    )

    return df_size_in_bytes

Credit: 100chou