Search code examples
apache-sparknon-deterministic

Sources of non-determinism of Apache Spark


I am trying to figure out all sources of non-determinism in Spark. I understand that non-determinism can come from user provided functions e.g in a map(f) with f involving random. I am instead looking for the operations that can lead to non-determinism either in terms of transformations/actions of at a lower level e.g shuffling.


Solution

  • Off the top of my head:

    • operations which require shuffling (or network traffic in general) may output values in non-deterministic order. It includes obvious cases like groupBy* or join. A less obvious example is an order of ties after sorting

    • operations which depend on the changing data sources or a mutable global state

    • side effects executed inside transformations, including accumulator updates