Search code examples
pythonpandasscipycorrelation

How to get p-value and pearson's r for a list of columns in Pandas?


I'm trying to make a multiindexed table (a matrix) of correlation coefficients and p-values. I'd prefer to use the scipy.stats tests.

x = pd.DataFrame(
    list(
        zip(
            [1,2,3,4,5,6], [5, 7, 8, 4, 2, 8], [13, 16, 12, 11, 9, 10]
            )
            ),
            columns= ['a', 'b', 'c'] 
            )
 

# I've tried something like this
for i in range(len(x.columns)):
    r,p = pearsonr(x[x.columns[i]], x[x.columns[i+1]])
    print(f'{r}, {p}')

Obviously the for loop won't work. What I want to end up with is:

a b c
a r 1.0 -.09 -.8
p .00 .87 .06
b r -.09 1 .42
p .87 .00 .41
c r -.8 .42 1
p .06 .41 00

I had written code to solve this problem (with help from this community) years ago, but it only worked for an older version of spearmanr.

Any help would be very much appreciated.


Solution

  • Here is one way to do it using scipy pearsonr and Pandas corr methods:

    import pandas as pd
    from scipy.stats import pearsonr
    
    def pearsonr_pval(x, y):
        return pearsonr(x, y)[1]
    
    
    df = (
        pd.concat(
            [
                x.corr(method="pearson").reset_index().assign(value="r"),
                x.corr(method=pearsonr_pval).reset_index().assign(value="p"),
            ]
        )
        .groupby(["index", "value"])
        .agg(lambda x: list(x)[0])
    ).sort_index(ascending=[True, False])
    
    df.index.names = ["", ""]
    

    Then:

    print(df)
    # Output
                a         b         c
    
    a r  1.000000 -0.088273 -0.796421
      p  1.000000  0.867934  0.057948
    b r -0.088273  1.000000  0.421184
      p  0.867934  1.000000  0.405583
    c r -0.796421  0.421184  1.000000
      p  0.057948  0.405583  1.000000