Search code examples
javaapache-sparkcoalesce

SPARK: Understanding coalesce method?


i'm trying to understand the coalesce method in spark.

I have a JavaRDD<String> (which consist of 16310 Strings) and I want to save it in 233 files. (one file with 70 strings)

First, i tried it with trainDataFeatures.repartition(233).saveAsTextFile(outputPathTrainFeatures);

This works well, but i don't want to shuffle my data. so i tried it with: trainDataFeatures.coalesce(233, false).saveAsTextFile(outputPathTrainFeatures);

here i get only 4 output files. Not shuffled but only 4!!! It's really annoying. Maybe someone can help me with this issue.


Solution

  • I think that is the whole point and the biggest difference between coalesce and repartition.

    Repartition does full shuffle of the data to be able to create those extra partitions. Coalesce moves data between exisiting partitions and avoids creating new partitions and avoids a full data shuffle.

    Basically the fact that coalesce does not create the extra partitions for you is a feature of coalesce.

    Same with repartition - it's able to work in a performant way thanks to a full data shuffle. You may not care about performance and just want to increase number of partitions without a shuffle - well, someone had that idea before and this issue here is still open.