i'm trying to understand the coalesce method in spark.
I have a JavaRDD<String>
(which consist of 16310 Strings) and I want to save it in 233 files. (one file with 70 strings)
First, i tried it with trainDataFeatures.repartition(233).saveAsTextFile(outputPathTrainFeatures);
This works well, but i don't want to shuffle my data. so i tried it with: trainDataFeatures.coalesce(233, false).saveAsTextFile(outputPathTrainFeatures);
here i get only 4 output files. Not shuffled but only 4!!! It's really annoying. Maybe someone can help me with this issue.
I think that is the whole point and the biggest difference between coalesce
and repartition
.
Repartition does full shuffle of the data to be able to create those extra partitions. Coalesce moves data between exisiting partitions and avoids creating new partitions and avoids a full data shuffle.
Basically the fact that coalesce does not create the extra partitions for you is a feature of coalesce.
Same with repartition - it's able to work in a performant way thanks to a full data shuffle. You may not care about performance and just want to increase number of partitions without a shuffle - well, someone had that idea before and this issue here is still open.