Search code examples
apache-sparkpysparkapache-spark-sql

Whether repartition() will always shuffle even before an action is triggered


I read that repartition() will be lazily evaluated as it is a transformation, and transformations are only triggered on actions.

However, I imagine that all the data must be loaded by Spark first before any repartitioning can be done based on a column value. In other words, my understanding is that all the data will still be loaded as-is without any repartitioning or optimizations, and only then will Spark do repartitioning. And that repartition() will always shuffle the data no matter what, even if it's called before any actions are triggered. Is my understanding correct?

df = spark.createDataFrame(data, ["id", "name", "age"])
repartitioned_df = df.repartition("age")
... # action triggered later

Solution

  • No, it must wait for an Action.