Search code examples
rlogistic-regression

Predictor in logistic regression for a large sample size (1.8 million obs.) predicts only 0's


I am trying to run a logistic regression model to predict the default probabilities of individual loans. I have a large sample size of 1.85 million observations, about 81% of which were fully paid off and the rest defaulted. I had run the logistic regression with 20+ other predictors that were statistically significant and got warning "fitted probabilities 0 or 1 occurred", and by adding predictors step by step I found that only 1 predictor was causing this problem, the "annual income" (annual_inc). I ran a logistic regression with only this predictor and found that it predicts only 0's (fully paid off loans), although there is a significant proportion of defaulted loans. I tried different proportions of training and testing data. If I give split the model in the way that gives 80% of the original sample to the Testing set and 20% to the Training set, R doesn't show the fitted probabilities warning, but the model still predicts 0's only on the testing set. Below I attach the little code concerned just in case. I doubt that adding a small sample of my data would be of any use given the circumstance, but if I am mistaken, let me know please and I will add it.

>set.seed(42)

>indexes <- sample(1:nrow(df), 0.8*nrow(df))
>df_test = df[indexes,]
>df_train = df[-indexes,]

>mymodel_2 <- glm(loan_status ~ annual_inc, data = df_train, family = 'binomial')
>summary(mymodel_2)

Call:
glm(formula = loan_status ~ annual_inc, family = "binomial", 
    data = df_train)

Deviance Residuals: 
  Min       1Q   Median       3Q      Max  
-0.6902  -0.6530  -0.6340  -0.5900   5.4533  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
  (Intercept) -1.308e+00  8.290e-03 -157.83   <2e-16 ***
  annual_inc  -2.426e-06  9.382e-08  -25.86   <2e-16 ***
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 352917  on 370976  degrees of freedom
Residual deviance: 352151  on 370975  degrees of freedom
AIC: 352155

Number of Fisher Scoring iterations: 4

>res <- predict(mymodel_2, df_test, type = "response")
>confmatrix <- table(Actual_value = df_test$loan_status, Predicted_value = res >0.5)
>confmatrix
            Predicted_value
Actual_value   FALSE
           0 1212481
           1  271426

Moreover, when I searched for the solution of the issue on the Internet, I seen it is often often attributed to perfect separation, but my case predicts 0's only, and the analogue-cases I have seen had small sample size. So far I am hesitant about implementing penalised logistic regression, because I think my issue is not perfect separation. Also, it is worth pointing out, I want to use logistic regression specifically due to specifics of research. How can I overcome the issue at hand?


Solution

  • As @deschen suggested I used resampling ROSE technique from ROSE package for R and it solved my issue, although over-, under-sampling methods, and a combination of both worked as well.