Search code examples
pythonpandascorrelationpearson-correlation

correlation matrix with group-by and sort


I am trying calculate correlation matrix with groupby and sort. I have 100 companies from 11 industries. I would like to group by industry and sort by their total assets (atq), and then calculate the correlation of data.pr_multi with this order. however, when I do sort and groupby, it reverses back and calculates by alphabetical order.

The code I use:

index datafqtr tic pr_multi atq industry
0 2018Q1 A NaN 8698.0 4
1 2018Q2 A -0.0856845728151735 8784.0 4
2 2018Q3 A 0.0035103320774146 8349.0 4
3 2018Q4 A -0.0157732687260246 8541.0 4
4 2018Q1 AAL NaN 53280.0 5
5 2018Q2 AAL -0.2694380292532717 52622.0 5

the code I use:

data1=data18.sort_values(['atq'],ascending=False).groupby('industry').head()
df = data1.pivot_table('pr_multi', ['datafqtr'], 'tic')
# calculate correlation matrix using inbuilt pandas function
correlation_matrix = df.corr()
correlation_matrix.head()

Solution

  • IIUC, you want to calculate the correlation between the order based on the groupby and the pr_multi column. use:

    data1=data18.groupby('industry')['atq'].apply(lambda x: x.sort_values(ascending=False))
    np.corrcoef(data1.reset_index()['level_1'], data18['pr_multi'].astype(float).fillna(0))
    

    Output:

    array([[ 1.        , -0.44754795],
           [-0.44754795,  1.        ]])