So this is my code for calculating the correlation between two variables using pearson's r.
def correlation(x, y):
std_x = (x - x.mean()) / x.std(ddof=0)
std_y = (y - y.mean()) / y.std(ddof=0)
return (std_x * std_y).mean()
I understand that in order to do so, one needs to:
Which brings me to my question, why is the below line used:
std_x = (x - x.mean()) / x.std(ddof=0)
instead of simply:
r = (x.std(ddof=0)*y.std(ddof=0))/len(x)
I think you get confused on the formula of Pearson's coefficient. Say you have two random variables X and Y. Then Pearson's coefficient is defined as
r = Cov(X, Y)/(s_X*s_Y)
Where Cov(X, Y)
is the covariance between X and Y, and s_Y
and s_Y
their standard deviation.
Cov(X, Y) = E[(X-E[X])*(Y - E[Y])]
Where E[Z]
designs the expected value of the random variable Z
.
Ok, now we have the formula, so how to compute that. Actually you can't since you don't have access to the real standard deviations and real expected values. Instead, what we usually do is compute the sample correlation coefficient, which is based on this formula but replacing the real values by the values given by estimators.
A natural (minimum variance non biased) estimator for the expected values in the formula, is simply the mean (given by np.mean
), and samewise, the right estimator for the standard deviation is the empirical standard deviation given by np.std
.
So putting it all together, the formula would become
r = np.mean((x-np.mean(x))*(y-np.mean(y)))/(np.std(x)*np.std(y))
which is actually the same as
np.mean(X*Y)
where X = (x-np.mean(x))/np.std(x)
and Y = (y-np.mean(y))/np.std(y)