Search code examples
apache-sparkpysparkspark-shuffle

Understanding the shuffle in spark


Shuffling in spark is (as per my understanding):

  1. Identify the partition that the records have to go to (Hashing and modulo)
  2. Serialize the data that needs to go to the same partition
  3. transmit the data
  4. The data gets deserialized and read by the executors on the other end

I have a question about this:

  1. How is the data transmitted between the executors? Even if we have the space available in Memory. Let us assume our execution memories are 50GiB per executor and the entire data to be shuffled is just 100 MB. Is the data transmission from Storage memory (exec 1) to Storage memory (exec 2) or are there disk writes involved as intermediate steps?

Solution

  • Spark shuffle outputs are always written to disk.

    Why ? because simply you cannot send data from an executor memory to another executor memory directly, it has to be written locally than loaded into the executor memory, that's why you have serialization deserialization during shuffling, that's why having a quality disks (ssd) is also important for spark.

    from blog.scottlogic.com

    During a shuffle, data is written to disk and transferred across the network, halting Spark’s ability to do processing in-memory and causing a performance bottleneck.