Search code examples
pythonpython-3.xapache-sparkpysparkrdd

What is the meaning of neutral zero value in the fold function of pyspark?


Here is the code snippet

from operator import add
iris1 = sc.textFile("./dataset/iris_site.csv")
iris1_split = iris1.map(lambda var1: var1.split(","))
iris1_split.map(lambda col:float(col[0])).fold(0,add)

Following is what I understood about fold function:

  1. It's used for aggregation.

  2. The add is an operator for addition of the measure data in the index 1 column.

  3. The first argument is called the neutral zero value as per this post. (But what does it actually mean I don't know.)

  4. I tried changing the zero value with 1, 2, -2, 10 and the following increment and decrements 2, 4, -4, 20 respectively occurred.

    By observing the pattern of increment/decrements,
    The equation seems like result = 2*neutral_zero_value + aggregation_result

Similar zeroValue can also be seen in the foldByKey function too.

Click here to get iris Dataset


Solution

  • The neutral zero value is actually an identity element of the operation. In the case that is shown above the operation is addition hence the identity element has to be 0. If it was multiplication then the identity element has to be 1.
    Now why does it take neutral zero? So, similar to fold() there is reduce(). On giving an empty collection to reduce(), it throws an exception where as fold() is already defined for an empty collection with the help of neutral zero.

    Analogy
    Imagine it as a variable sum which is initialized as 0 for doing the addition operation.

    sum_ = 0 # here 0 is an identity element for addition
    collection = [1,2,4,5]
    for elem in collection:
        sum_ += elem
    

    Even if you pass an empty list, the sum_ is defined.

    Similarly, for multiplication

    prod = 1 # here 1 is an identity element for multiplication
    collection = [1,2,4,5]
    for elem in collection:
        prod *= elem
    

    For more details see this article. Read about reduce and fold function in it.