Search code examples
scalaapache-sparkdistributed-computing

Shuffled vs non-shuffled coalesce in Apache Spark


What is the difference between the following transformations when they are executed right before writing RDD to a file?

  1. coalesce(1, shuffle = true)
  2. coalesce(1, shuffle = false)

Code example:

val input = sc.textFile(inputFile)
val filtered = input.filter(doSomeFiltering)
val mapped = filtered.map(doSomeMapping)

mapped.coalesce(1, shuffle = true).saveAsTextFile(outputFile)
vs
mapped.coalesce(1, shuffle = false).saveAsTextFile(outputFile)

And how does it compare with collect()? I'm fully aware that Spark save methods will store it with HDFS-style structure, however I'm more interested in data partitioning aspects of collect() and shuffled/non-shuffled coalesce().


Solution

  • shuffle=true and shuffle=false aren't going to have any practical differences in the resulting output since they are both going down to a single partition. However, when you set it to true you will do a shuffle which isn't of any use. With shuffle=true the output is evenly distributed amongst the partitions (and your also able to increase the # of partitions if you wanted), but since your target is 1 partition, everything is ending up in one partition regardless.

    As for comparision with collect(), the difference is all of the data is stored on a single executor rather than on the driver.