Search code examples
pythonpandasstatisticscorrelationpearson-correlation

pandas: Compute correlation among rows of a table


I have a table of values with two category fields and a single count. I need to compute the correlation among rows of one category based on their counts across the other category.

So for instance:

Category_A|Category_B|Count
Alan      |Turkey    |7
Alan      |Ham       |1
Alan      |Spam      |0
...
Bob       |Turkey    |2
Bob       |Ham       |9
Bob       |Spam      |12
...

I need to get Pearson's correlation with p-value in a table like this:

Category_A_1|Category_A_2|Correlation|P Value
Alan        |Bob         |0.7        |0.07

I can't figure out how to do this in Pandas. Please help. Thanks!


Solution

  • corrs = df.pivot('Category_A','Category_B').T.corr().stack()
    #Category_A  Category_A
    #Alan        Alan          1.000000
    #            Bob          -0.986552
    #Bob         Alan         -0.986552
    #            Bob           1.000000
    corrs.index.names = 'A','B'
    corrs.reset_index()
    #      A     B         0
    #0  Alan  Alan  1.000000
    #1  Alan   Bob -0.986552
    #2   Bob  Alan -0.986552
    #3   Bob   Bob  1.000000
    

    Unfortunately, Pandas does not have tools for calculating the p-values.