Search code examples
python-3.xapache-sparkpyspark

Pyspark: Standard deviation using reduce throws overflow error


I'm trying to calculate the standard deviation of a set of data without using rdd.stdev(). I've tried two methods and the one in which I use rdd.reduce() fails and throws the OverflowError: (34, 'Numerical result out of range') error.

If I do it this way, everything seems to work:

data = sc.textFile("data.csv") # Space-separated values
rdd  = data.map(lambda x: float(x.split(" ")[3])) # Only the fourth column
mean = rdd.mean() # High number: 1410000

rdd_list = rdd.collect()
part_sum = 0
for i in rdd_list:
    part = (i - mean) ** 2
    part_sum += part

variance = part_sum / rdd.count() # 5.8 trillion
std_dev  = variance ** 0.5 # 2425645. The same if I do rdd.stdev()

However, this way gives me an Overflow error:

data = sc.textFile("data.csv") # Same file
rdd  = data.map(lambda x: float(x.split(" ")[3])) # 4th column
mean = rdd.mean() # Same mean

def partial(x, y):
    a = (x - mean) ** 2
    b = (y - mean) ** 2
    return a + b

part_sum = rdd.reduce(partial) # This throws the error: 
File "<stdin>", line 2, in partial
OverflowError: (34, 'Numerical result out of range')

What is going on? I know using rdd.stdev() is useful and gives me the correct result, I'm just trying to solve it without using it as a practicing exercise, but I don't understand what's happening.


Solution

  • The implementation of partial is not correct. It would add most values more than once to the sum.

    Let's assume the list of values consists of 1, 5 and 6. The mean would be 4 and the variance according to the first code block would be 4.67 ((9+1+4)/3).

    Using the reduce mechanism, the following happens:

    • Step 1: partial(1, 5) returns 9 + 1 = 10
    • Step 2: partial(10, 6) returns 36 + 4 = 40

    After dividing the result by the count, the variance is calculated as 13.33. The problem occurs in step 2, when the result from step 1 is used as input for the reduce function. It is (again) compared to the mean and added to the sum. The sums become larger and larger, resulting in an error.

    Instead of using reduce aggregate should work:

    rdd.aggregate(0, lambda a,b: a+(b-mean)**2, lambda a,b: a+b)