Search code examples
pythonpandasregressioncorrelationpearson-correlation

different p value between statsmodels linear regression and pandas df.corr() function


I am working on a housing price prediction dataset. It has 13 features and I am using multiple linear regression model. When I check the correlation of the features and target value it shows weird results for df.corr() method and Summary() function.

For few features the values of p correlation coefficient are low. But if I use summary() functions after regression these features have different p-values. The feature which has lowest value of correlation coefficient doesn't have the highest p value. Or there is no similarity/correlation in correlation coefficient and p values obtained by these two different functions. What could possibly have gone wrong?

for correlation coefficient

correlation_matrix = BostonHousing_df.corr().round(2)

for p values

X=BostonHousing_df.iloc[:,:-1].values
y=BostonHousing_df.iloc[:,-1].values
X_opt = X1[:,[0,1,2,3,4,5,6,7,8,9,10,11,12,13]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

As dataframe.corr() method by default uses pearson corelation .both functions should give similar results.but that is not happening .below are the two images of results.

pvalues

CorelationCoeff

if you observe 2 results from images , the features which has lowest corelation coeff doesnt have high p values.


Solution

  • The issue here is that when you check pairwise Pearson correlations, you are not factoring for the effect of all the other variables. So you can't expect a direct relation between Pearson correlation to the target, and the p-value within the regression model.

    Here is an extreme example to illustrate this:

    Say we have a target c, which is defined by sum of two features a+b. Say you have the following training set:

    a = [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]    
    b = [4, 3, 2, 1, 0, 6, 5, 4, 3, 2]  
    c = [5, 5, 5, 5, 5, 7, 7, 7, 7, 7]
    

    Notice here that, even though a+b perfectly gives you c, if you just check the correlation between a and c, you would have 0!

    numpy.corrcoef(a, c)  
    > array([[1., 0.],
             [0., 1.]])  
    

    But if you plug these data into a linear regression estimator, you would of course get an extremely small p-value for a.

    So as you see, small pairwise correlation to the target does not necessarily imply a lack of effect/small p-value.