Search code examples
pythonpython-3.xnumpystatisticsvariance

Why is NumPy's variance result different from mine?


I'm having a lot of trouble understanding how variance works and would be grateful if someone would please explain where I'm going wrong.

First of all, the two NumPy methods below provide the same answer for the variance of a particular array (people who are more experienced with Python than I am have told me there's no reason to be suspicious of the results!).

  1. np.var(myArray)
  2. np.mean(abs(myArray - np.mean(myArray))**2)

The problem is that I get a different result when using a third method that someone else uses and don't understand why. I'm worried that there is something I don't understand about what Python is doing. Can anyone please help me figure this out?

  1. First term - second term (where an element in myArray is the square root of xComponent + yComponent + zComponent dotted with themselves).

First term = (np.dot(xComponent, xComponent) + np.dot(yComponent, yComponent) + np.dot(zComponent, zComponent))/len(zComponent)

I get the same answer when I don’t use the individual components but use the total value; i.e., np.mean(myArray**2).

Second term=np.mean(xComponent)**2 + np.mean(yComponent)**2 + np.mean(zComponent)**2

Subtracting the second term from the first term leads to a very different result than what I get from NumPy by using methods 1 and 2 above. One thing that this method does do that method 2 doesn't is provide a tiny number for the second term. I have read that the second term should be tiny, although I don't understand why.


Edit: Here is an example array. It is much shorter than the usual datasets I use. I can't currently test it but will do so as soon as possible to check it represents my problem like the full dataset does.

myArray=np.array([33.4479672],
                 [36.1206867],
                 [33.84485692],
                 [27.28590267],
                 [21.85568418],
                 [17.01874484],
                 [25.50861718],
                 [29.40798574],
                 [36.71092762],
                 [45.72983789],
                 [40.47352496]])

Here are the corresponding x, y, z components:

23.7427145, -7.72698565, 22.25631845
25.37794739, -9.0226496, 24.06772919
22.1871844, 2.31027064, 25.4532088
19.29475621, 8.9243651, 17.1052207
9.18913589, 18.3261273, -7.57520763
10.00418173, 5.86260433, -12.45728278
-4.0904234, 15.13497563, 20.12189104
-12.83798541, -16.57398946, 20.62325458
-5.6879695, -21.33899754, 29.32552461
19.06079677, 28.16146311, 30.57508946
25.88007, 27.25161939, 15.02256438

Solution

  • The variance of a random variable X with mean mu (mu = E[X]) can be expressed, equivalently, as (here, outside of code blocks, ^ denotes "to the power of" as is commonly used in math):

    1. Var(X) = E[(X-mu)^2]
    2. Var(X) = E[X^2] - (E[X])^2 = E[X^2] - (mu)^2

    Your second method uses the first form of the definition. It looks like your third method attempts to use the second form, but does not do it correctly.

    Your first term, (np.dot(xComponent, xComponent) + np.dot(yComponent, yComponent) + np.dot(zComponent, zComponent))/len(zComponent) is equal to np.mean(myArray**2) since they calculate the same thing. For an element a in the array with components x, y, and z, you have, by definition a^2 = x+2 + y^2 + z^2. In your first term, np.dot of each component computes, elementwise, the sums of squares of that particular component, which you then add up for all three components. Dividing by the length of the array then gives you the mean of the squares.

    So, your first term represents E[X^2].

    To compute the variance, you need to subtract E[X]^2. So, your second term should be np.mean(myArray)**2. I do not think there is any easy way to represent it using components. If your array is X=[a1,a2,...,an] with components [x1,y1,z1], [x2,y2,z2], ..., [xn,yn,zn], then E[X]^2 = ((sqrt(x1^2+y1^2+z1^2)+sqrt(x2^2+y2^2+z2^2)+...+sqrt(xn^2+yn^2+zn^2))/n)^2, and you cannot neatly separate the three components out. It may have been that the values in the code you saw had some specific property that made the results equivalent.

    So, finally, for your example:

    >>> myArray
    array([[33.4479672 ],
           [36.1206867 ],
           [33.84485692],
           [27.28590267],
           [21.85568418],
           [17.01874484],
           [25.50861718],
           [29.40798574],
           [36.71092762],
           [45.72983789],
           [40.47352496]])
    >>> xComponent
    array([ 23.7427145 ,  25.37794739,  22.1871844 ,  19.29475621,
             9.18913589,  10.00418173,  -4.0904234 , -12.83798541,
            -5.6879695 ,  19.06079677,  25.88007   ])
    >>> yComponent
    array([ -7.72698565,  -9.0226496 ,   2.31027064,   8.9243651 ,
            18.3261273 ,   5.86260433,  15.13497563, -16.57398946,
           -21.33899754,  28.16146311,  27.25161939])
    >>> zComponent
    array([ 22.25631845,  24.06772919,  25.4532088 ,  17.1052207 ,
            -7.57520763, -12.45728278,  20.12189104,  20.62325458,
            29.32552461,  30.57508946,  15.02256438])
    >>> np.var(myArray)
    63.77153203225587
    >>> np.mean(abs(myArray - np.mean(myArray))**2)
    63.77153203225587
    >>> (np.dot(xComponent, xComponent) + np.dot(yComponent, yComponent) + np.dot(zComponent, zComponent))/len(zComponent) -
      np.mean(myArray)**2
    63.77153212702058