Spark: Difference Between Reduce() vs Fold()

I'm studying Spark using Learning Spark, Lightning-Fast Data Analysis book.

I have been to many sites and read many articles but I still did not understand the difference between reduce() and fold().

According to the book that I'm using:

"Similar to reduce() is fold(), which also takes a function with the same signature as needed for reduce(), but in addition takes a “zero value” to be used for the initial call on each partition. The zero value you provide should be the identity element for your operation; that is, applying it multiple times with your function should not change the value (e.g., 0 for +, 1 for *, or an empty list for concatenation)."

To help me better understand, I run the following code:

rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 2)

rdd.getNumPartitions()
Out[1]: 2

rdd.glom().collect()
Out[2]: [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]

rdd.reduce(lambda x,y: x+y)
Out[3]: 55

rdd.fold(0, lambda x,y: x+y)
Out[4]: 55

Question: 1) Referencing: "but in addition takes a “zero value” to be used for the initial call on each partition." What does it mean initial call on each partition?

2) Referencing: "The zero value you provide should be the identity element for your operation; that is, applying it multiple times with your function should not change the value" If that's the case, what is the point of providing "the value" for the operation?

3) According to the example I provided above, both produced the sum of 55. What's the difference?

Solution

the difference is that fold lets you change the type of the result, whereas reduce doesn't and thus can use values from the data. e.g.

rdd.fold("",lambda x,y: x+str(y))
'12345678910'

Your example doesn't change the type of the result and indeed in that example, you can use reduce instead of fold.

a "normal" fold used in a non-distributed environment uses the initial value once. However, as spark runs distributed it would run a fold that would start with the initial value in each partition and again when combining the results Because in your example you've created the 10 numbers above in 2 partitions if we'd call the following :

rdd.fold("HERE",lambda x,y: x+str(y))

we'd get

'HEREHERE12345HERE678910'