Search code examples
apache-sparknetwork-programmingdependenciesrddpartitioning

In SPARK, why Narrow Dependency strictly doesn't require schuffle over the network?


I was reading about Narrow Vs Wide dependencies of an RDD partitioned across multiple partititon.

My Question: I do not understand that why RDDs built with Narrow Dependencies do not require a schuffle over the network? OR is it that shuffle DOES happens, but only a few number of times?

Please refer to the diagram below - enter image description here

Let's say a child RDD is created with Narrow Dependency from a parent RDD, as marked in the red rectangle below. Now, parent RDD had 3 partitions, let's say (P1,P2,P3) and data in each respective partition got mapped got mapped into 3 other partitions, let's say (P1,P4,P5) respectively.

Since, the data in parent RDD partition P1 got mapped to itself, so there is no shuffle over the network. But since the data from parent RDD partition P2 & P3 got mapped to child RDD partitions P4 & P5, which are different partitions, so naturally the data has to pass through the network to have the corresponding values placed in P4 & P5. Thus, why do we say that there is no shuffle over the network?

See the box in green, this is even more complex case. Only case which I could visualize where there is no shuffle over the network should be when parent RDD partitions get mapped to itself.

I am sure my reasoning is incorrect. Could someone provide some explanation? Thanks


Solution

  • From the link you provided:

    A typical execution sequence is as follows ... RDD is created originally from external data sources (e.g. HDFS, Local file ... etc) RDD undergoes a sequence of TRANSFORMATION (e.g. map, flatMap, filter, groupBy, join), each provide a different RDD that feed into the next transformation. Finally the last step is an ACTION (e.g. count, collect, save, take), which convert the last RDD into an output to external data sources The above sequence of processing is called a lineage (outcome of the topological sort of the DAG)

    Now think about how the data is processed as it makes it's way through the pipeline.

    If there is a narrow dependency, then the child partition is only dependent on 1 parent partition. The parent partition's data can be processed on 1 machine and the child partition can then exist on the same machine. No shuffling of data is necessary.

    If there is a wide dependency, then 1 child partition is dependent on many parent partitions. The parent partitions may exist on many machines, so the data must be shuffled across the network in order to complete the data processing.