Search code examples
scalaapache-sparkpysparkgroup-bygrouping

Execution of Group By - single vs two iteration


When we perform groupBy in spark, does it bring the entire grouped data to a single partition/executor or does it perform initial grouping on whichever partition the data is present and finally bring all unique groups from each partition to a single partition for final group?

For example: If we have 100 records(just for illustration) having two fields city and country and 5 groups based on city and country i.e. each city, and country have 20 records on different partitions. When we perform group by city, country does Spark perform 1st level grouping at individual partition level and later perform the final group on a single partition by bringing all related city and country groups or does Spark perform just a single level of grouping by shuffling all related groups instead of two-level grouping?


Solution

  • Spark generally wants to minimize data movement during shuffling, so it will perform "group by" first on the partitions and then shuffle the data to bring related groups together again.

    Example: When you perform groupBy("city", "country") it will shuffle within the partitions(intra) and then inter partitions to bring together (same city and country) together in the same partition for the final movement.

    Though there are other factors this depends on.