Search code examples
apache-sparkrdd

What is the result of RDD transformation in Spark?


Can anyone explain, what is the result of RDD transformations? Is it the new set of data (copy of data) or it is only new set of pointers, to filtered blocks of old data?


Solution

  • RDD transformations allow you to create dependencies between RDDs. Dependencies are only steps for producing results (a program). Each RDD in lineage chain (string of dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD. Spark will divide RDD dependencies into stages and tasks and send those to workers for execution.

    So if you do this:

    val lines = sc.textFile("...")
    val words = lines.flatMap(line => line.split(" "))
    val localwords = words.collect()
    

    words will be an RDD containing a reference to lines RDD. When the program is executed, first lines' function will be executed (load the data from a text file), then words' function will be executed on the resulting data (split lines into words). Spark is lazy, so nothing will get executed unless you call some transformation or action that will trigger job creation and execution (collect in this example).

    So, an RDD (transformed RDD, too) is not 'a set of data', but a step in a program (might be the only step) telling Spark how to get the data and what to do with it.