Search code examples
python-3.xnumpyloopsstatisticsvariance

NumPy: how to calculate variance along each row of a 2D array using np.var and by hand (i.e., not using np.var; calculating each term explicitly)?


I am using Python to import data from large files. There are three columns corresponding to x, y, z data. Each row represents a time at which the data were collected. For example:

importedData = [[1, 2, 3],  <--This row: x, y, and z data at time 0.
                [4, 5, 6],
                [7, 8, 9]];
  1. I want to calculate the variance for each time (row). As far as I know, one way to do this is as follows (if this is not correct, I would appreciate a heads-up):

    varPerTimestep = np.var(importedData,axis=1);

  2. Here's my problem. To convince a coworker it works, I would next like to do the same thing, but avoid using np.var. This means solving:

    Var(S)=(⟨S_bar⋅S_bar⟩−⟨S_bar⟩⟨S_bar⟩) # S_bar, x, y, z

I'm an intermittent Python user and just can't figure out how to do this for each row. I found a suggestion online but don't know how to adapt the code below so it applies to each row (apologies; can't provide the link because when I do, I get an error that my code is not formatted correctly and I can't post the question; also the reason that some of the code is formatted as quotes below):

def variance(data, ddof=0):
     n = len(data)
     mean = sum(data) / n
     return sum((x - mean) ** 2 for x in data) / (n - ddof)

I have tried various things. For example, putting the function in a loop where I first attempted just getting a row average:

for row in importedData:
    mean_test = np.mean(importedData,axis=1)
print(mean_test)

This gives me an error I can't figure out:

Traceback (most recent call last):
  File "<string>", line 13, in <module>
TypeError: list indices must be integers or slices, not tuple

I also tried this and get no output because I seem to be stuck in a loop:

 n = len(importedData[0,:])         # Trying to get the length of each row.
 mean = mean(importedData[0,:])     # Likewise trying to get the mean of each row.
 deviations = [(x - mean) ** 2 for x in importedData]
 variance = sum(deviations) / n

If anyone could please point me in the right direction, I would be grateful.


Solution

  • Well you could do something like this to make things more explicit:

    import numpy as np 
    
    importedData = np.arange(1,10).reshape(3,3)
    
    # Get means for each row
    means = [row.mean() for row in importedData]
    
    # Calculate squared errors
    squared_errors = [(row-mean)**2 for row, mean in zip(importedData, means)]
    
    # Calculate "mean for each row of squared errors" (aka the variance)
    variances = [row.mean() for row in squared_errors]
    
    # Sanity check
    print(variances)
    print(importedData.var(1))
    
    # [0.6666666666666666, 0.6666666666666666, 0.6666666666666666]
    # [0.66666667 0.66666667 0.66666667]