python logistic-regression data-analysis statsmodels dummy-variable

Logistic Regression using Python statsmodel

I am new to the Analytics field and I have few doubts. I hope I can get my answers here.

I am in the middle of implementing Logistic regression using python. If we need to apply the logistic on the categorical variables, I have implemented get_dummies for that. Suppose the column name is house type (Beach, Mountain and Plain). What we do here is create three dummy variables in this case and drop one of them, as we can infer the Plain using the other 2 dummy variables.

But when I implement RFE on the data. Do I need to include all the 3 variables? (I saw this in some blog where dummy was not dropped and got confused)

Also, I need to add an intercept column also, as I was using statsmodel (which does not add intercept on it's own). So, in that case if there are multiple categorical variables (and we have dropped a dummy for each) there won't be any issue right ?

Solution

You should end up seeing multicolinearity as the third dummy column is always the opposite of the sum of the first two (1 if they sum to 0, and 0 if they sum to 1).

This should be removed prior to feature selection, such as RFE.

If you don't, statsmodel is going to throw an warning in the summary and if you check the VIF of the features post fitting, you'll see unacceptable scores that are suggesting colinear features.

In any case once this is done, it is feasible that one of your dummy columns could actually be a constant, such as if you had no beach houses in your data set. The default behavior of statsmodel ignore the add_constant statement when a constant exists. To get around this you may consider the has_constant parameter, passing 'add' to indicate you'd like to get an intercept even if there is already a constant column.

X = sm.add_constant(X, has_constant='add')