I am using Python to import data from large files. There are three columns corresponding to x, y, z data. Each row represents a time at which the data were collected. For example:
importedData = [[1, 2, 3], <--This row: x, y, and z data at time 0.
[4, 5, 6],
[7, 8, 9]];
I want to calculate the variance for each time (row). As far as I know, one way to do this is as follows (if this is not correct, I would appreciate a heads-up):
varPerTimestep = np.var(importedData,axis=1);
Here's my problem. To convince a coworker it works, I would next like to do the same thing, but avoid using np.var. This means solving:
Var(S)=(⟨S_bar⋅S_bar⟩−⟨S_bar⟩⟨S_bar⟩)
# S_bar, x, y, z
I'm an intermittent Python user and just can't figure out how to do this for each row. I found a suggestion online but don't know how to adapt the code below so it applies to each row (apologies; can't provide the link because when I do, I get an error that my code is not formatted correctly and I can't post the question; also the reason that some of the code is formatted as quotes below):
def variance(data, ddof=0):
n = len(data)
mean = sum(data) / n
return sum((x - mean) ** 2 for x in data) / (n - ddof)
I have tried various things. For example, putting the function in a loop where I first attempted just getting a row average:
for row in importedData:
mean_test = np.mean(importedData,axis=1)
print(mean_test)
This gives me an error I can't figure out:
Traceback (most recent call last): File "<string>", line 13, in <module> TypeError: list indices must be integers or slices, not tuple
I also tried this and get no output because I seem to be stuck in a loop:
n = len(importedData[0,:]) # Trying to get the length of each row. mean = mean(importedData[0,:]) # Likewise trying to get the mean of each row. deviations = [(x - mean) ** 2 for x in importedData] variance = sum(deviations) / n
If anyone could please point me in the right direction, I would be grateful.
Well you could do something like this to make things more explicit:
import numpy as np
importedData = np.arange(1,10).reshape(3,3)
# Get means for each row
means = [row.mean() for row in importedData]
# Calculate squared errors
squared_errors = [(row-mean)**2 for row, mean in zip(importedData, means)]
# Calculate "mean for each row of squared errors" (aka the variance)
variances = [row.mean() for row in squared_errors]
# Sanity check
print(variances)
print(importedData.var(1))
# [0.6666666666666666, 0.6666666666666666, 0.6666666666666666]
# [0.66666667 0.66666667 0.66666667]