Search code examples
apache-sparkdirected-acyclic-graphs

Can someone distinguish between RDD Lineage and a DAG (Direct Acyclic Graph)?


Can someone clarify what is the difference & similarities between RDD Lineage & DAG (Direct Acyclic graphs)?


Solution

  • DAG (direct acyclic graph) is the representation of the way Spark will execute your program - each vertex on that graph is a separate operation and edges represent dependencies of each operation. Your program (thus DAG that represents it) may operate on multiple entities (RDDs, Dataframes, etc). RDD Lineage is just a portion of a DAG (one or more operations) that lead to the creation of that particular RDD.

    So, one DAG (one Spark program) might create multiple RDDs, and each RDD will have its lineage (i.e that path in your DAG that lead to that RDD). If some partitions of your RDD got corrupted or lost, then Spark may rerun that part of the DAG that leads to the creation of those partitions.

    If the sole purpose of your Spark program is to create only one RDD and it's the last step, then the whole DAG is a lineage of that RDD.

    You can find out more here - https://data-flair.training/blogs/rdd-lineage/