Search code examples
scikit-learnlinear-regressionrfe

StatsModel Linear Regression: Initial vs Reduced Model - Is it better?


I am running linear regression using a dataset (granted, it is for school purposes and I was told its fictitious information) and first I chose my variables (from the larger dataset) and encoded them accordingly.

I ran the first initial regression and got the following results shown in the screenshot.
Initial Regression Model Summary

I then ran RFE and selected 3 features to be selected and reran the regression to which I then obtain the following results.
Reduced Regression model

The code used for the x,y splitting in Initial Model:

# Creating feature variables, where X = independent variables and Y=dependent variables
X_data = df2.drop('TotalCharge', axis=1)
Y_data = df2[['TotalCharge']]

print('The shape of the features is:',X_data.shape)
X_data.head()
print('The shape of the labels:',Y_data.shape)
Y_data.head()

code used for Reduced model:

X_data2 = df2[['Age', 'Income', 'VitD_levels', 'Doc_visits', 'Gender_Male', 'Services_Intravenous', 'Overweight_Yes', 'Diabetes_Yes']]
Y_data2 = df2[['TotalCharge']]

print('The shape of the features is:',X_data2.shape)
X_data2.head()
print('The shape of the labels:',Y_data2.shape)
Y_data2.head()

My question is: Is the reduced model better? Not quite sure how to analyze this (still new to this....)

I tried choosing new features, checking for multicollinearity, normalizing before running the regression and even using Scikitlearn over Statsmodel. Not sure how to analyze the results to see if it is better...


Solution

  • A couple of observations:

    1. You had p-values of 0 for Complication_risk, Initial_admin_Emergency Admission, and Arthritis_Yes. This indicates that these variables are significant at the 5% level of significance - yet these were removed from the reduced model - thereby reducing the predictive power of the model.

    2. In any event, the R-Squared statistics for both models are quite low (0.021 and 0.001). This indicates that the model is not doing a good job at predicting the variation in the dependent variable, or the TotalCharge variable. An R-Squared of 1 indicates that the model explains 100% of the variation whereas an R-Squared of 0 explains 0% of the variation.

    The short answer to your question is that the reduced model is not better than the original - but the original model does not have much predictive power either.

    A good next step might be to run the original model with only the significant variables, i.e. Complication_risk, Initial_admin_Emergency Admission, and Arthritis_Yes - and see if the fit as measured by R-Squared improves. If it does not, then this is a good indication that the variation in the dependent variable cannot be adequately explained by the independent variables provided.