Search code examples
scalaapache-sparkpysparkrddtransformation

Why is union() a narrow transformation and intersection() is a wide transformation in spark?


I am trying to understand the underlying concept in Spark from here. As far as I have understood, narrow transformation produce child RDDs that are transformed from a single parent RDD (might be multiple partitions of the same RDD). However, union and intersection both require two or more RDDs for the transformations to be performed. Can someone please clear this theoretically?


Solution

  • No, your understanding is incorrect. A a narrow transformation is the one that only requires a single partition from the source to compute all elements of one partition of the output. union is therefore a narrow transformation, because to create an output partition, you only need the single partition from the source data.

    Intersection on the other hand is wide, because even for a single partition of the output, it requires access to the entire content of (at least) one of the source rdds.