Inconsistent output when using foreach on a partitioned RDD in Apache Spark: should it be avoided?

The foreach action on a partitioned RDD yields unpredictable results. Why does this happen?

For instance, I attempted to print doubled values using the foreach action on an RDD partitioned into two slices (numSlices=2). Here is the code:

numbers = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], numSlices=2)
print(numbers.collect())

def f(x):
    print(x, x*2)
    
numbers.foreach(lambda x: f(x))

However, the output is inconsistent. Sometimes it correctly displays the pairs consisting of the original value x and 2x. But when I rerun the code, the output differs, as shown below. Should I avoid using foreach when working with partitioned RDDs?

[1, 2, 3, 4, 5, 6, 7, 8, 9]
15  102

62  124

73  146

84  168

9 18

Solution

Apache Spark is a parallel computing engine. It does not process partitions sequentially but in parallel.

print(x, x*2) basically does the following actions sequentially: print(x), print(' '), print(x*2), print(\n). Now, let's analyze the first line of your result. Let's consider that 1 and 5 are in distinct partitions. If they are processed in parallel by two separate executors, the following sequence of actions could happen:

print(1) # executor_1
print(5) # executor_2
print(' ') # executor_1
print(' ') # executor_2
print(5*2) # executor_2
print(1*2) # executor_1
print(\n) # executor_2
print(\n) # executor_1

For both executors, the sequence of action is coherent but since the two sequences are entangled, it yields the strange output you witness:

15  102

Notice that if you force spark to handle everything on the same core, the output becomes more coherent:

>>> numbers.coalesce(1).foreach(lambda x: f(x))
1 2
2 4
3 6
4 8
5 10
6 12
7 14
8 16
9 18
10 20