Search code examples
apache-sparkrdddirected-acyclic-graphs

How is fault tolerance achieved when there is no data replication in spark?


I have been reading about RDD's a lot but something that I don't quite understand how is RDD distributed when there is no replication in Apache spark?

The post[1] says that

To achieve Spark fault tolerance for all the RDDs, the entire data is copied across multiple nodes in the cluster.

As per my understanding, If this this the case then there should be data replication, but most articles says the DAG is the way spark achieves fault tolerance.

Can someone explain a bit detail on this?

[1]: https://hevodata.com/learn/spark-fault-tolerance/#:~:text=Spark%20Fault%20Tolerance%20is%20the,or%20more%20faults%20within%20them).


Solution

  • Your assumption on replication is wrong or correct, depending on your perspective.

    Spark replicates nothing, as it tends to want to work in-memory. If data is persisted, then if on HDFS or S3, those products will replicate.

    There seems to be some copying and pasting on the Internet going on where Spark fault tolerance is concerned. The 'misinformation' is being copied therefore.

    RDD lineage or checkpointing help in restoring data that needs to be re-computed from the start or from a location on disk.