When we perform groupBy
in spark, does it bring the entire grouped data to a single partition/executor or does it perform initial grouping on whichever partition the data is present and finally bring all unique groups from each partition to a single partition for final group?
For example: If we have 100 records(just for illustration) having two fields city
and country
and 5 groups based on city and country i.e. each city, and country have 20 records on different partitions.
When we perform group by city, country
does Spark perform 1st level grouping at individual partition level and later perform the final group on a single partition by bringing all related city and country groups or does Spark perform just a single level of grouping by shuffling all related groups instead of two-level grouping?
Spark generally wants to minimize data movement during shuffling, so it will perform "group by" first on the partitions and then shuffle the data to bring related groups together again.
Example: When you perform groupBy("city", "country")
it will shuffle within the partitions(intra) and then inter partitions to bring together (same city and country) together in the same partition for the final movement.
Though there are other factors this depends on.