Search code examples
pythonpandasstatisticscovariance

Get correlation between a subset of columns in pandas


So, I am familiar with some more advanced Pandas indexing and with using .apply() on columns/rows of data, but for some reason I cannot figure out how to get a variance-covariance matrix for some columns manually without reverting to .corr().

Whenever I try the following to slice the columns I want (all rows, and all columns after "APPL R"),

a = np.cov(data1.loc[:,'AAPL R':])

I get an answer like

array([[  7.27823733e-04,   2.26078140e-04,   1.99410574e-04, ...,
          9.57801851e-05,  -2.76751908e-04,   5.71509542e-05],
       [  2.26078140e-04,   1.26173616e-04,   2.17306383e-04, ...,
          3.30070375e-05,  -2.47989495e-05,   1.16819595e-05],
       [  1.99410574e-04,   2.17306383e-04,   1.92984210e-03, ...,
         -2.27400788e-04,  -4.46589343e-05,  -5.90865613e-05],
       ..., 
       [  9.57801851e-05,   3.30070375e-05,  -2.27400788e-04, ...,

How could I manually get a simple output like this in Pandas?

enter image description here

Thank you in advance.


Solution

  • If you insist on not using DataFrame.corr():

    • You aren't sending the correct lists to the numpy function. You want each column as its own list, so you should be using the transpose of the DataFrame.

    • You want the correlation matrix, not the covariance matrix, so you should be using np.corrcoef(). (If you really want the covariance matrix, then fine use np.cov())

    Here's a df I had in memory from another question.

                 open   high    low  close  volume
    date                                          
    2017-11-01  44.66  44.75  42.19  42.93    3500
    2017-11-03  44.66  44.75  42.19  42.93    3500
    2017-11-06  43.15  43.75  40.60  41.02    9200
    2017-11-07  43.15  43.75  40.60  41.02    9200
    2017-11-08  43.15  43.75  40.60  41.02    9200
    2017-11-09  43.15  43.75  40.60  41.02    9200
    2017-11-10  43.15  43.75  40.60  41.02    9200
    2017-11-13  41.60  43.21  40.03  42.36    3575
    2017-11-14  41.60  43.21  40.03  42.36    3575
    ...
    
    df.corr()
    #            open      high       low     close    volume
    #open    1.000000  0.891708  0.957078  0.351604  0.320314
    #high    0.891708  1.000000  0.878307  0.610183  0.311939
    #low     0.957078  0.878307  1.000000  0.559366  0.146151
    #close   0.351604  0.610183  0.559366  1.000000 -0.132609
    #volume  0.320314  0.311939  0.146151 -0.132609  1.000000
    
    np.corrcoef(df.loc[:,'open':].values.T)
    #array([[ 1.        ,  0.89170836,  0.95707833,  0.35160354,  0.32031362],
    #       [ 0.89170836,  1.        ,  0.87830748,  0.61018322,  0.31193906],
    #       [ 0.95707833,  0.87830748,  1.        ,  0.55936625,  0.14615072],
    #       [ 0.35160354,  0.61018322,  0.55936625,  1.        , -0.13260909],
    #       [ 0.32031362,  0.31193906,  0.14615072, -0.13260909,  1.        ]])