apache-spark distributed-computing partition

Why hardcode repartition value

Looking at some of the example spark code I see the numbers in repartitioning or coalesce is hardcoded:

val resDF = df.coalesce(16)

what's the best approach to manage this parameter where this hardcoded value becomes irrelevant when the cluster can be updated dynamically in matter of seconds.

Solution

Well in examples it is common to see hardcoded values, so you do not need to worry, I mean feel free to modify the example. I mean the Partitions documentation is full of hardcoded values, but these values are just examples.

The Rule of Thumb about number of partitions is:

one would want his RDD to have as many partitions as the product of the number of executors by the number of used cores by 3 (or maybe 4). Of course, that's a heuristic and it really depends on your application, dataset and cluster configuration.

However, notice that repartition doesn't come for free, so in a dramatically dynamic environment, you have to be sure that the overhead of repartitioning is negligible towards the gains you will get by this operation.

Coalesce and repartition might have different cost, as I mention in my answer.