Search code examples
scalaapache-sparkrdddata-partitioning

In Spark, when no partitioner is specified, does the ReduceByKey operation repartition the data by hash before starting aggregating it?


If we do not mention any partitioner for a reduceByKey operation, does it perform hashPartitioning internally before the reduction? For example my test code is like:

val rdd = sc.parallelize(Seq((5, 1), (10, 2), (15, 3), (5, 4), (5, 1), (5,3), (5,9), (5,6)), 5)
val newRdd = rdd.reduceByKey((a,b) => (a+b))

Here, does the reduceByKey operation brings all records with same key to the same partition and the perform the reduction (for the above code when no partitioner is mentioned)? Since my use case has skewed data (all having same key), it can cause out of memory error if it brings all records to one partition. So a uniform distribution of the records over all the partitions suits the use case here.


Solution

  • In fact, the great advantage of using reduceByKey instead of groupByKey is that spark will combine keys that are on the same partition before the shuffle (that is before re partitioning anything). Therefore it is very unlikely to get memory issues because of skewed data using reduceByKey.

    For more details, you may want to read this post from databricks that compares reduceByKey vs groupByKey. In particular, they say this:

    While both of these functions will produce the correct answer, the reduceByKey example works much better on a large dataset. That's because Spark knows it can combine output with a common key on each partition before shuffling the data.