python-3.x machine-learning scikit-learn logistic-regression categorical-data

Logistic regression coefficients not making sense

I am trying to build logistic regression model for telecom churn.

Some background: To predict churn our dataset has data on variables like Account_Age, Current_Bill_Amount, Avg_Days_Delinquent (Days since bill unpaid), complaints, Avg_Calls and some more.

My question relates to complaint variable. Complaint variable has been transformed into 5 dummy variables as there are 6 complaint categories as shown in the image. 5 dummy columns were created leaving 'pricing' complaints out.

Now as shown in image complaints on'Call Quality', 'Billing Problem' has high absolute & percentage churn and other complaint types not contributing so much to churn.

Images may not be showing at your end so the links for same are: https://i.sstatic.net/VTD12.jpg https://i.sstatic.net/FXLhT.jpg

I have 2 problems with regards to complaints influence on churn.

Problem 1:

The algorithm does not consider 'Call Quality' a significant variable & it has P value of 0.527. Given that 81% of customers (refer image) with 'Call Quality' complaint churn the algorithm is giving contradictory result. Can't understand why this is happening, call quality definitely influences churn. Please share your thoughts on this.

Problem 2:

Coefficients for model significant variables (P<0.05) 'Billing Problem', 'Check Account' and 'Moving' are -1.0033, -2.5675 and -2.1132 respectively. Common sense is when there is a complaint it should increase churn and thus coefficient should be positive. Then why for these 3 dummy variables algorithm is calculating negative coefficients?

Let me know if you need any more info or has any clarification.

import statsmodels.api as sm
logReg=sm.Logit(Y_train,X_train)
logistic_regression=logReg.fit() 
logistic_regression.summary()

Answers to problem 1 and 2 is expected result.

Solution

Problem 1:

The p-values is the probability that the coefficient is not null, not the significance of your feature, even if it is often interpreted as it. All you can really conclude here is that it is not possible to tell (with good confidence) that the coefficient is not null. Check the 95% confidence interval for your value, it will be broad and include positive values for the coefficient.

As an example, a possible explanation could be that this variable carries redundant information with some others, which could explain why the algorithm can't state its usefulness. Try forward or backward selection to iteratively select relevant variables, it might change your final selection.

Problem 2:

There is absolutely no problem with the coefficients beeing negative.

Indeed, what you are modeling with the logistic regression is:

P(churn) = 1/(1+exp(sum(beta_i*x_i)) (see on wikipedia as an example)

beta_i beeing the coefficient for variable x_i

You can see that a negative coefficient lowers the churn probability.

Here you use the set of user with a reported complain (I cannot see a category 'no complain') for which according to the picture you linked, the churn probability is 48.5%.

So the 'default' churn probability is 48.5%, however the churn probability for the dummy variable 'moving' is only 13.7%. So adding the information that the user has a complain of category 'moving' lowers the churn probability. Hence the negative coefficient, and it is the same for 'Billing Problem' and 'Check Account'.

Now if you added the whole set of users it could be that any type of complain would increase the churn probability and you would get positive coefficient.