I have a data in pandas dataframe like:
df =
X1 X2 X3 Y
0 1 2 10 5.077
1 2 2 9 32.330
2 3 3 5 65.140
3 4 4 4 47.270
4 5 2 9 80.570
and I want to do multiple regression analysis. Here Y is dependent variables and x1, x2 and x3 are independent variables. correlation between each independent variables with dependent variable is:
df.corr():
X1 X2 X3 Y
X1 1.000000 0.353553 -0.409644 0.896626
X2 0.353553 1.000000 -0.951747 0.204882
X3 -0.409644 -0.951747 1.000000 -0.389641
Y 0.896626 0.204882 -0.389641 1.000000
As we can see here y has highest correlation with x1 so i have selected x1 as first independent variable. And following the process I am trying to select second independent variable with highest partial correlation with y. How to find partial correlation in such case?
Pairwise ranks between Y
(last col) and others
If you are only trying to find the correlation rank between Y
and others, simply do -
corrs = df.corr().values
ranks = (df.columns[:-1][-corrs[:-1,-1].argsort()]).tolist()
Sample run -
In [145]: df
Out[145]:
X1 X2 X3 Y
0 0.576562 0.481220 0.148405 0.929005
1 0.732278 0.934351 0.115578 0.379051
2 0.078430 0.575374 0.945908 0.999495
3 0.391323 0.429919 0.265165 0.837510
4 0.525265 0.331486 0.951865 0.998278
In [146]: df.corr()
Out[146]:
X1 X2 X3 Y
X1 1.000000 0.354387 -0.642953 -0.646551
X2 0.354387 1.000000 -0.461510 -0.885174
X3 -0.642953 -0.461510 1.000000 0.649758
Y -0.646551 -0.885174 0.649758 1.000000
In [147]: corrs = df.corr().values
In [148]: (df.columns[:-1][-corrs[:-1,-1].argsort()]).tolist()
Out[148]: ['X3', 'X1', 'X2']
Pairwise ranks between all columns
If you are trying to find the rank between all columns between each other, we would have one approach like so -
def pairwise_corr_rank(df):
corrs = df.corr().values
cols = df.columns
n = corrs.shape[0]
r,c = np.triu_indices(n,1)
idx = corrs[r,c].argsort()
out = np.c_[cols[r[idx]], cols[c[idx]], corrs[r,c][idx]][::-1]
return pd.DataFrame(out, columns=[['P1','P2','Value']])
Sample run -
In [109]: df
Out[109]:
X1 X2 X3 Y
0 1 2 10 5.077
1 2 2 9 32.330
2 3 3 5 65.140
3 4 4 4 47.270
4 5 2 9 80.570
In [110]: df.corr()
Out[110]:
X1 X2 X3 Y
X1 1.000000 0.353553 -0.409644 0.896626
X2 0.353553 1.000000 -0.951747 0.204882
X3 -0.409644 -0.951747 1.000000 -0.389641
Y 0.896626 0.204882 -0.389641 1.000000
In [114]: pairwise_corr_rank(df)
Out[114]:
P1 P2 Value
0 X1 Y 0.896626
1 X1 X2 0.353553
2 X2 Y 0.204882
3 X3 Y -0.389641
4 X1 X3 -0.409644
5 X2 X3 -0.951747