I'm trying to calculate the DataFrame size to determine the number of partitions for repartitioning the DataFrame while writing to a Parquet file.
I found a post regarding the size estimator ( here ), below is the code
def estimate_size(spark: SparkSession, input_df: DataFrame):
input_df.cache().foreach(lambda x: x)
df_size_in_bytes = (
spark._jsparkSession.sessionState()
.executePlan(input_df._jdf.queryExecution().logical())
.optimizedPlan()
.stats()
.sizeInBytes()
)
return df_size_in_bytes
I'm using PySpark version 3.5.0
. I wanted to understand what is the best way to calculate the size of the partition in PySpark. Any help will be appreciated.
The resolution is in the same post:
"As of this commit on Spark core, the method executionPlan asks for two parameters, the logicalPlan and mode."
You'd have to modify your function definition as below:
def estimate_size(spark: SparkSession, input_df: DataFrame):
input_df.cache().foreach(lambda x: x)
df_size_in_bytes = (
spark._jsparkSession.sessionState()
.executePlan(input_df._jdf.queryExecution().logical(),
input_df._jdf.queryExecution().mode())
.optimizedPlan()
.stats()
.sizeInBytes()
)
return df_size_in_bytes
Credit: 100chou