I am working on a housing price prediction dataset. It has 13 features and I am using multiple linear regression model. When I check the correlation of the features and target value it shows weird results for df.corr()
method and Summary()
function.
For few features the values of p correlation coefficient are low. But if I use summary()
functions after regression these features have different p-values. The feature which has lowest value of correlation coefficient doesn't have the highest p value. Or there is no similarity/correlation in correlation coefficient and p values obtained by these two different functions. What could possibly have gone wrong?
correlation_matrix = BostonHousing_df.corr().round(2)
X=BostonHousing_df.iloc[:,:-1].values
y=BostonHousing_df.iloc[:,-1].values
X_opt = X1[:,[0,1,2,3,4,5,6,7,8,9,10,11,12,13]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
As dataframe.corr() method by default uses pearson corelation .both functions should give similar results.but that is not happening .below are the two images of results.
if you observe 2 results from images , the features which has lowest corelation coeff doesnt have high p values.
The issue here is that when you check pairwise Pearson correlations, you are not factoring for the effect of all the other variables. So you can't expect a direct relation between Pearson correlation to the target, and the p-value within the regression model.
Here is an extreme example to illustrate this:
Say we have a target c
, which is defined by sum of two features a+b
. Say you have the following training set:
a = [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
b = [4, 3, 2, 1, 0, 6, 5, 4, 3, 2]
c = [5, 5, 5, 5, 5, 7, 7, 7, 7, 7]
Notice here that, even though a+b
perfectly gives you c
, if you just check the correlation between a
and c
, you would have 0!
numpy.corrcoef(a, c)
> array([[1., 0.],
[0., 1.]])
But if you plug these data into a linear regression estimator, you would of course get an extremely small p-value for a
.
So as you see, small pairwise correlation to the target does not necessarily imply a lack of effect/small p-value.