Search code examples
pythonpearson-correlation

Why is the built-in python sum function behaving like this?


I am trying to write a program that determines the pearson correlation coefficient with population standard deviation in python. I thought this would be pretty trivial until I got to the part where I was summing (yi - μy)*(xi - μx). Here is my full code:

def r(x, y):
    mx, my = sum(x) / len(x), sum(y) / len(y)
    sdx, sdy = (sum([(xi-mx)**2 for xi in x]) / len(x))**0.5, (sum([(yi- 
    my)**2 for yi in y]) / (len(y)))**0.5
    res = ((sum([(xi-mx)*(yi-my) for xi in x for yi in y]))/(len(x)*sdx*sdy))**0.5
    return res

I noticed the result was super small, so I checked out the sum of (xi-mx):

sum([(xi-mx) for xi in x])

and the result was -9.769962616701378e-15. Here are the values in the list:

print([(xi-mx) for xi in x])
[3.2699999999999987, 3.0699999999999994, 1.2699999999999987, 1.0699999999999985, 0.9699999999999989, 0.2699999999999987, -0.7300000000000013, -1.7300000000000013, -2.7300000000000013, -4.730000000000001]

Can anyone explain why python is behaving so strangely with this?


Solution

  • res = (sum([(xi-mx)*(yi-my) for xi in x for yi in y]))/(len(x)*sdx*sdy)
    

    That isn't doing what you think it does. When calculating the numerator of Pearson's correlation coefficient, (xi - mx) * (yi - my) should be paired sequentially. Using zip should fix it.

    res = (sum([(xi-mx)*(yi-my) for xi, yi in zip(x, y)]))/(len(x)*sdx*sdy)
    

    This is what I'm getting:

    def r(x, y):
        mx, my = sum(x) / len(x), sum(y) / len(y)
        sdx, sdy = (sum([(xi-mx)**2 for xi in x]) / len(x))**0.5, (sum([(yi-
        my)**2 for yi in y]) / (len(y)))**0.5
        res = (sum([(xi-mx)*(yi-my) for xi, yi in zip(x, y)]))/(len(x)*sdx*sdy)
        return res
    
    r(x, y) # 0.6124721937208479
    

    What does for xi in x for yi in y really do?

    >>> x, y = [1, 2, 3], [4, 5, 6]
    >>> [(xi, yi) for xi in x for yi in y]
    [(1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6), (3, 4), (3, 5), (3, 6)]
    

    So there's repetition going on. (Actually generating a list of combinations.) You can use zip to aggregate values into pairs:

    >>> [*zip(x, y)]
    [(1, 4), (2, 5), (3, 6)]