Search code examples
pythonscipypearson-correlation

Pearson multiple correlation with Scipy


I am trying to do something quite simple compute a Pearson correlation matrix of several variables that are given as columns of a DataFrame. I want it to ignore nans and provide also the p-values. scipy.stats.pearsonr is insufficient because it works only for two variables and cannot account for nans. There should be something better than that...

For example,

    df = pd.DataFrame([[1,2,3],[6,5,4],[1,None,9]])

       0    1  2
    0  1  2.0  3
    1  6  5.0  4
    2  1  NaN  9

The columns of df are the variables and the rows are observations. I would like a command that returns a 3x3 correlation matrix, along with a 3x3 matrix of corresponding p-values. I want it to omit the None. That is, the correlation between [1,6,1],[2,5,NaN] should be the correlation between [1,6] and [2,5].

There must be a nice Pythonic way to do that, can anyone please suggest?


Solution

  • If you have your data in a pandas DataFrame, you can simply use df.corr().

    From the docs:

    DataFrame.corr(method='pearson', min_periods=1)
    Compute pairwise correlation of columns, excluding NA/null values