Search code examples
rlogistic-regressionensemble-learning

Logistic Regression - Multiple Models to Address Variable Interaction


I am attempting to develop a model to predict the chance of structure fire leading to a fatality using logistic regression. This is about a 1/100 event.

The problem that I am facing is that interaction among the variables doesn't appear to be taking into consideration when predicting variables.

For example. Taking the entire dataset into consideration, fires are more deadly in the winter months. However, fires where appliances were the root cause have no such trend. An example of cooking vs. appliance fires are below (I believe these are fatalities per 1000 fires). The x axis is months 1 to 12.

https://i.imgur.com/nTnnfg6.png?1

When attempting to predict the probability of a fatality where appliances were the cause, I get decreasing values in the summer and increasing values in the winter despite that trend clearly not being the case for appliance fires as per above.

My questions are:

  1. Would creating conditional models be a good solution for this? i.e. subset the data for each cause and create a separate model for each subset. My concern is that this is probably overly complex and I'm sure it's violating some rule of something somewhere.
  2. Is there a better solution than creating these conditional models?
  3. Would creating an ensemble model (simple mean) between this logistic regression model and random forest model be a valid solution? My random forest models were flawed in that they predicted too many cases where values were 100% or 0%.
  4. Can the formula be rewritten in such a way that these variable interactions are taken into account? i.e. Fatality ~ month * Cause
  5. Bonus: Any other advice on solving this issue.

My training data is as follows:

> str(train_val)
'data.frame':   154178 obs. of  13 variables:
 $ month     : Factor w/ 12 levels "1","2","3","4",..: 4 7 7 8 8 11 7 10 6 3 ...
 $ weekday   : Factor w/ 7 levels "Friday","Monday",..: 3 7 2 5 4 3 6 1 5 3 ...
 $ RT        : num  420 480 300 360 600 420 120 240 420 120 ...
 $ CAUSE_CODE: Factor w/ 16 levels "1","2","3","4",..: 6 5 1 7 13 15 16 13 9 15 ...
 $ FIRST_IGN : Factor w/ 11 levels "00","10","12",..: 11 3 10 10 8 10 11 5 5 3 ...
 $ AREA_ORIG : Factor w/ 11 levels "14","21","24",..: 10 10 10 4 3 1 5 10 6 6 ...
 $ HEAT_SOURC: Factor w/ 11 levels "00","10","11",..: 11 2 11 2 2 11 11 11 10 11 ...
 $ INC_TYPE  : Factor w/ 7 levels "110","111","112",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ HUM_FAC_1 : Factor w/ 9 levels "0","1","2","3",..: 9 9 3 9 9 3 9 9 2 9 ...
 $ ALARMS_YN : Factor w/ 3 levels "N","O","Y": 3 3 3 3 3 3 1 1 3 3 ...
 $ losscat   : Factor w/ 4 levels "Minor_Loss","Med_Loss",..: 1 3 1 2 1 1 4 2 2 1 ...
 $ daycat    : Factor w/ 5 levels "Aft-Noon","Evening",..: 1 5 1 5 2 4 2 4 5 5 ...
 $ Fatality  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

Model formula and results are as follows:

> summary(log.mod)

Call:
glm(formula = Fatality ~ ., family = binomial(link = logit), 
    data = train_val)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6507  -0.1691  -0.0886  -0.0487   4.0763  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -3.060e+01  4.532e+03  -0.007 0.994612    
month2            -4.069e-02  9.474e-02  -0.430 0.667545    
month3            -1.077e-01  9.638e-02  -1.117 0.263997    
month4            -3.045e-01  1.056e-01  -2.883 0.003945 ** 
month5            -4.459e-01  1.126e-01  -3.962 7.45e-05 ***
month6            -5.637e-01  1.191e-01  -4.734 2.20e-06 ***
month7            -5.853e-01  1.173e-01  -4.989 6.06e-07 ***
month8            -4.610e-01  1.160e-01  -3.976 7.02e-05 ***
month9            -5.055e-01  1.195e-01  -4.230 2.33e-05 ***
month10           -2.619e-01  1.073e-01  -2.440 0.014676 *  
month11           -1.167e-01  9.830e-02  -1.187 0.235065    
month12           -2.634e-01  1.021e-01  -2.579 0.009902 ** 
weekdayMonday     -1.440e-01  9.117e-02  -1.580 0.114177    
weekdaySaturday   -4.038e-04  8.616e-02  -0.005 0.996261    
weekdaySunday     -5.934e-02  8.778e-02  -0.676 0.499077    
weekdayThursday    1.360e-02  8.754e-02   0.155 0.876560    
weekdayTuesday    -6.722e-02  8.948e-02  -0.751 0.452512    
weekdayWednesday  -3.070e-02  8.843e-02  -0.347 0.728421    
RT                 1.994e-05  2.222e-05   0.898 0.369388    
CAUSE_CODE2       -4.331e-01  3.336e-01  -1.298 0.194277    
CAUSE_CODE3        7.813e-01  2.773e-01   2.817 0.004844 ** 
CAUSE_CODE4       -8.593e-02  1.808e-01  -0.475 0.634692    
CAUSE_CODE5        5.543e-02  1.927e-01   0.288 0.773622    
CAUSE_CODE6        5.294e-02  1.777e-01   0.298 0.765724    
CAUSE_CODE7       -3.656e-01  2.201e-01  -1.661 0.096714 .  
CAUSE_CODE8       -3.122e-01  1.874e-01  -1.666 0.095691 .  
CAUSE_CODE9        9.558e-02  2.044e-01   0.468 0.639972    
CAUSE_CODE10       1.818e-01  2.634e-01   0.690 0.490167    
CAUSE_CODE11      -1.198e+00  3.951e-01  -3.031 0.002436 ** 
CAUSE_CODE12      -1.632e+00  4.607e-01  -3.542 0.000397 ***
CAUSE_CODE13       2.235e-01  1.162e-01   1.923 0.054482 .  
CAUSE_CODE14      -4.895e-01  2.653e-01  -1.845 0.064979 .  
CAUSE_CODE15      -2.877e-01  1.362e-01  -2.113 0.034595 *  
CAUSE_CODE16       7.487e-01  1.373e-01   5.451 5.01e-08 ***
FIRST_IGN10       -6.033e-01  3.100e-01  -1.946 0.051673 .  
FIRST_IGN12       -1.639e+00  4.875e-01  -3.362 0.000774 ***
FIRST_IGN15       -6.184e-01  2.788e-01  -2.218 0.026529 *  
FIRST_IGN17       -5.808e-01  2.431e-01  -2.389 0.016911 *  
FIRST_IGN18       -1.280e+01  1.068e+02  -0.120 0.904587    
FIRST_IGN21        7.630e-01  2.049e-01   3.724 0.000196 ***
FIRST_IGN76       -5.524e-01  2.513e-01  -2.198 0.027916 *  
FIRST_IGN81       -2.210e-01  2.618e-01  -0.844 0.398660    
FIRST_IGNOther     7.508e-02  1.881e-01   0.399 0.689780    
FIRST_IGNUU        2.367e-01  1.887e-01   1.254 0.209663    
AREA_ORIG21       -5.657e-01  8.059e-02  -7.019 2.24e-12 ***
AREA_ORIG24       -7.024e-01  9.924e-02  -7.078 1.46e-12 ***
AREA_ORIG26       -1.923e+00  2.536e-01  -7.584 3.36e-14 ***
AREA_ORIG47       -2.114e+00  1.996e-01 -10.593  < 2e-16 ***
AREA_ORIG72       -1.795e+00  2.292e-01  -7.831 4.83e-15 ***
AREA_ORIG74       -2.271e+00  2.604e-01  -8.722  < 2e-16 ***
AREA_ORIG75       -1.454e+00  2.562e-01  -5.674 1.39e-08 ***
AREA_ORIG76       -2.450e+00  4.177e-01  -5.866 4.46e-09 ***
AREA_ORIGOther    -9.926e-01  7.631e-02 -13.008  < 2e-16 ***
AREA_ORIGUU       -1.067e+00  8.522e-02 -12.526  < 2e-16 ***
HEAT_SOURC10      -4.244e-01  1.972e-01  -2.152 0.031368 *  
HEAT_SOURC11      -3.284e-01  2.533e-01  -1.296 0.194851    
HEAT_SOURC12      -1.106e-01  1.834e-01  -0.603 0.546424    
HEAT_SOURC13      -2.146e-01  2.053e-01  -1.045 0.295942    
HEAT_SOURC40      -5.954e-01  2.675e-01  -2.226 0.026036 *  
HEAT_SOURC43      -3.533e-01  2.753e-01  -1.283 0.199414    
HEAT_SOURC60       4.204e-02  2.375e-01   0.177 0.859472    
HEAT_SOURC61      -2.616e-02  3.182e-01  -0.082 0.934494    
HEAT_SOURCOther   -2.552e-01  1.827e-01  -1.397 0.162513    
HEAT_SOURCUU      -4.886e-02  1.550e-01  -0.315 0.752669    
INC_TYPE111        1.325e+01  1.007e+03   0.013 0.989507    
INC_TYPE112        1.268e+01  1.007e+03   0.013 0.989956    
INC_TYPE120        1.333e+01  1.007e+03   0.013 0.989436    
INC_TYPE121        1.305e+01  1.007e+03   0.013 0.989662    
INC_TYPE122        1.331e+01  1.007e+03   0.013 0.989459    
INC_TYPE123       -9.385e-01  1.375e+03  -0.001 0.999456    
HUM_FAC_11         1.343e+01  4.419e+03   0.003 0.997575    
HUM_FAC_12         1.338e+01  4.419e+03   0.003 0.997585    
HUM_FAC_13         1.181e+01  4.419e+03   0.003 0.997867    
HUM_FAC_14         1.365e+01  4.419e+03   0.003 0.997536    
HUM_FAC_15         1.528e+01  4.419e+03   0.003 0.997241    
HUM_FAC_16         1.271e+01  4.419e+03   0.003 0.997706    
HUM_FAC_17         1.292e+01  4.419e+03   0.003 0.997667    
HUM_FAC_1N         1.224e+01  4.419e+03   0.003 0.997790    
ALARMS_YNO        -1.552e-01  7.111e-02  -2.182 0.029104 *  
ALARMS_YNY         3.230e-03  6.400e-02   0.050 0.959746    
losscatMed_Loss    1.281e+00  1.012e-01  12.660  < 2e-16 ***
losscatMajor_Loss  1.910e+00  1.032e-01  18.500  < 2e-16 ***
losscatTotal_Loss  2.197e+00  1.003e-01  21.904  < 2e-16 ***
daycatEvening      2.340e-01  9.753e-02   2.400 0.016406 *  
daycatMid-Day      3.360e-01  1.104e-01   3.044 0.002334 ** 
daycatMorning      7.029e-01  8.020e-02   8.764  < 2e-16 ***
daycatNight        6.102e-01  7.431e-02   8.211  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 20315  on 154177  degrees of freedom
Residual deviance: 16919  on 154091  degrees of freedom
AIC: 17093

Number of Fisher Scoring iterations: 17

Solution

  • I would be very careful with how you use a logistic regression. Throwing in "the kitchen sink" into a model will usually give some abnormal results. I would start by thinking about the different variables that are important, and only working with those that provide some relevant information. Fitting a regression model is not about throwing all variables in and looking for what sticks, but instead thinking about which variables are important and using some step-wise method to find the important covariates. This, in itself, may solve your problems you mentioned about coefficient direction.

    For factor variables, you can always recode them so that you are only working with significant factors. For example, instead of month1-month12, you could have, month4-month9, other. Having separate coefficients for each month is not necessary if all months are not significant.

    In terms of interactions, yes of course you can specify interactions with month:cause. Use interactions with caution, you should only add interactions if it make sense to.

    I would not recommend using conditional models, for this will reduce your degrees of freedom significantly. And adding interactions can achieve the same affect as conditional models but in a single model.

    I would really only use an ensemble model if you know your models are valid. Averaging 2 poor models will not provide better results.

    I hope this helps!