I am trying to do something quite simple compute a Pearson correlation matrix of several variables that are given as columns of a DataFrame. I want it to ignore nans and provide also the p-values. scipy.stats.pearsonr
is insufficient because it works only for two variables and cannot account for nans. There should be something better than that...
For example,
df = pd.DataFrame([[1,2,3],[6,5,4],[1,None,9]])
0 1 2
0 1 2.0 3
1 6 5.0 4
2 1 NaN 9
The columns of df are the variables and the rows are observations. I would like a command that returns a 3x3 correlation matrix, along with a 3x3 matrix of corresponding p-values. I want it to omit the None. That is, the correlation between [1,6,1],[2,5,NaN] should be the correlation between [1,6] and [2,5].
There must be a nice Pythonic way to do that, can anyone please suggest?
If you have your data in a pandas DataFrame, you can simply use df.corr()
.
From the docs:
DataFrame.corr(method='pearson', min_periods=1)
Compute pairwise correlation of columns, excluding NA/null values