I'm interested in finding out how Spark implements fault tolerance. In their paper they describe how they do it for "narrow dependencies" like map which is fairly straight forward. However, I they do not state what they do if a node crashes after a wide dependency like a sort operation. The only thing I could find is this:
In contrast, in a lineage graph with wide dependencies, a single failed node might cause the loss of some partition from all the ancestors of an RDD, requiring a complete re-execution.
Which is not really enough for understanding what's happening.
After a sort, there is no way of telling where the data that was stored on the crashed node came from without storing some additional information. So if a crash happens after a sort, is the entire lineage re-executed or is there some mechanism reducing the computational overhead? And what about other wide dependencies?
RDD dependencies are actually in terms of partitions and how they are created from partitions of other RDDs.
A wide dependency means the data required to create a partition is obtained from more than one partitions(from same or different RDDs). Each partition is assigned an executor.
Now assume, we are joining two RDDs R1 and R2 that have n and m partitions respectively. Also, for the sake of simplicity, let's assume that both R1 and R2 have been calculated by (n x m) different executors. We are going to create a third RDD R3 by joining R1 and R2.
When R3 is being calculated, assume a node containing x executors (out of (n x m) executors) failed for some reason. It doesn't affect the remaining executors and their data on the other nodes.
Only those partitions in R3 which were supposed to be created from those failed x executor's data are affected. And only those x partitions are recreated.
A more detailed visual explanation is available here
Updated: About Spark caching Below URLs should help you understand the whole persistence feature of Spark