Search code examples
apache-sparkpysparkrdd

Why the Spark's repartition didn't balance data into partitions?


>>> rdd = sc.parallelize(range(10), 2)
>>> rdd.glom().collect()
[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]
>>> rdd.repartition(3).glom().collect()
[[], [0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]
>>>

The first partition is empty? Why? I really appreciate you telling me the reasons.


Solution

  • That happens because Spark doesn't shuffle individual elements but rather blocks of data - with minimum batch size equal to 10.

    So if you have less elements than that per partition, Spark won't separate content of partitions.