scala apache-spark distributed-computing

Shuffled vs non-shuffled coalesce in Apache Spark

What is the difference between the following transformations when they are executed right before writing RDD to a file?

coalesce(1, shuffle = true)
coalesce(1, shuffle = false)

Code example:

val input = sc.textFile(inputFile)
val filtered = input.filter(doSomeFiltering)
val mapped = filtered.map(doSomeMapping)

mapped.coalesce(1, shuffle = true).saveAsTextFile(outputFile)
vs
mapped.coalesce(1, shuffle = false).saveAsTextFile(outputFile)

And how does it compare with collect()? I'm fully aware that Spark save methods will store it with HDFS-style structure, however I'm more interested in data partitioning aspects of collect() and shuffled/non-shuffled coalesce().

Solution

shuffle=true and shuffle=false aren't going to have any practical differences in the resulting output since they are both going down to a single partition. However, when you set it to true you will do a shuffle which isn't of any use. With shuffle=true the output is evenly distributed amongst the partitions (and your also able to increase the # of partitions if you wanted), but since your target is 1 partition, everything is ending up in one partition regardless.

As for comparision with collect(), the difference is all of the data is stored on a single executor rather than on the driver.