I am attempting to develop a model to predict the chance of structure fire leading to a fatality using logistic regression. This is about a 1/100 event.
The problem that I am facing is that interaction among the variables doesn't appear to be taking into consideration when predicting variables.
For example. Taking the entire dataset into consideration, fires are more deadly in the winter months. However, fires where appliances were the root cause have no such trend. An example of cooking vs. appliance fires are below (I believe these are fatalities per 1000 fires). The x axis is months 1 to 12.
When attempting to predict the probability of a fatality where appliances were the cause, I get decreasing values in the summer and increasing values in the winter despite that trend clearly not being the case for appliance fires as per above.
My questions are:
Fatality ~ month * Cause
My training data is as follows:
> str(train_val)
'data.frame': 154178 obs. of 13 variables:
$ month : Factor w/ 12 levels "1","2","3","4",..: 4 7 7 8 8 11 7 10 6 3 ...
$ weekday : Factor w/ 7 levels "Friday","Monday",..: 3 7 2 5 4 3 6 1 5 3 ...
$ RT : num 420 480 300 360 600 420 120 240 420 120 ...
$ CAUSE_CODE: Factor w/ 16 levels "1","2","3","4",..: 6 5 1 7 13 15 16 13 9 15 ...
$ FIRST_IGN : Factor w/ 11 levels "00","10","12",..: 11 3 10 10 8 10 11 5 5 3 ...
$ AREA_ORIG : Factor w/ 11 levels "14","21","24",..: 10 10 10 4 3 1 5 10 6 6 ...
$ HEAT_SOURC: Factor w/ 11 levels "00","10","11",..: 11 2 11 2 2 11 11 11 10 11 ...
$ INC_TYPE : Factor w/ 7 levels "110","111","112",..: 2 2 2 2 2 2 2 2 2 2 ...
$ HUM_FAC_1 : Factor w/ 9 levels "0","1","2","3",..: 9 9 3 9 9 3 9 9 2 9 ...
$ ALARMS_YN : Factor w/ 3 levels "N","O","Y": 3 3 3 3 3 3 1 1 3 3 ...
$ losscat : Factor w/ 4 levels "Minor_Loss","Med_Loss",..: 1 3 1 2 1 1 4 2 2 1 ...
$ daycat : Factor w/ 5 levels "Aft-Noon","Evening",..: 1 5 1 5 2 4 2 4 5 5 ...
$ Fatality : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
Model formula and results are as follows:
> summary(log.mod)
Call:
glm(formula = Fatality ~ ., family = binomial(link = logit),
data = train_val)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6507 -0.1691 -0.0886 -0.0487 4.0763
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.060e+01 4.532e+03 -0.007 0.994612
month2 -4.069e-02 9.474e-02 -0.430 0.667545
month3 -1.077e-01 9.638e-02 -1.117 0.263997
month4 -3.045e-01 1.056e-01 -2.883 0.003945 **
month5 -4.459e-01 1.126e-01 -3.962 7.45e-05 ***
month6 -5.637e-01 1.191e-01 -4.734 2.20e-06 ***
month7 -5.853e-01 1.173e-01 -4.989 6.06e-07 ***
month8 -4.610e-01 1.160e-01 -3.976 7.02e-05 ***
month9 -5.055e-01 1.195e-01 -4.230 2.33e-05 ***
month10 -2.619e-01 1.073e-01 -2.440 0.014676 *
month11 -1.167e-01 9.830e-02 -1.187 0.235065
month12 -2.634e-01 1.021e-01 -2.579 0.009902 **
weekdayMonday -1.440e-01 9.117e-02 -1.580 0.114177
weekdaySaturday -4.038e-04 8.616e-02 -0.005 0.996261
weekdaySunday -5.934e-02 8.778e-02 -0.676 0.499077
weekdayThursday 1.360e-02 8.754e-02 0.155 0.876560
weekdayTuesday -6.722e-02 8.948e-02 -0.751 0.452512
weekdayWednesday -3.070e-02 8.843e-02 -0.347 0.728421
RT 1.994e-05 2.222e-05 0.898 0.369388
CAUSE_CODE2 -4.331e-01 3.336e-01 -1.298 0.194277
CAUSE_CODE3 7.813e-01 2.773e-01 2.817 0.004844 **
CAUSE_CODE4 -8.593e-02 1.808e-01 -0.475 0.634692
CAUSE_CODE5 5.543e-02 1.927e-01 0.288 0.773622
CAUSE_CODE6 5.294e-02 1.777e-01 0.298 0.765724
CAUSE_CODE7 -3.656e-01 2.201e-01 -1.661 0.096714 .
CAUSE_CODE8 -3.122e-01 1.874e-01 -1.666 0.095691 .
CAUSE_CODE9 9.558e-02 2.044e-01 0.468 0.639972
CAUSE_CODE10 1.818e-01 2.634e-01 0.690 0.490167
CAUSE_CODE11 -1.198e+00 3.951e-01 -3.031 0.002436 **
CAUSE_CODE12 -1.632e+00 4.607e-01 -3.542 0.000397 ***
CAUSE_CODE13 2.235e-01 1.162e-01 1.923 0.054482 .
CAUSE_CODE14 -4.895e-01 2.653e-01 -1.845 0.064979 .
CAUSE_CODE15 -2.877e-01 1.362e-01 -2.113 0.034595 *
CAUSE_CODE16 7.487e-01 1.373e-01 5.451 5.01e-08 ***
FIRST_IGN10 -6.033e-01 3.100e-01 -1.946 0.051673 .
FIRST_IGN12 -1.639e+00 4.875e-01 -3.362 0.000774 ***
FIRST_IGN15 -6.184e-01 2.788e-01 -2.218 0.026529 *
FIRST_IGN17 -5.808e-01 2.431e-01 -2.389 0.016911 *
FIRST_IGN18 -1.280e+01 1.068e+02 -0.120 0.904587
FIRST_IGN21 7.630e-01 2.049e-01 3.724 0.000196 ***
FIRST_IGN76 -5.524e-01 2.513e-01 -2.198 0.027916 *
FIRST_IGN81 -2.210e-01 2.618e-01 -0.844 0.398660
FIRST_IGNOther 7.508e-02 1.881e-01 0.399 0.689780
FIRST_IGNUU 2.367e-01 1.887e-01 1.254 0.209663
AREA_ORIG21 -5.657e-01 8.059e-02 -7.019 2.24e-12 ***
AREA_ORIG24 -7.024e-01 9.924e-02 -7.078 1.46e-12 ***
AREA_ORIG26 -1.923e+00 2.536e-01 -7.584 3.36e-14 ***
AREA_ORIG47 -2.114e+00 1.996e-01 -10.593 < 2e-16 ***
AREA_ORIG72 -1.795e+00 2.292e-01 -7.831 4.83e-15 ***
AREA_ORIG74 -2.271e+00 2.604e-01 -8.722 < 2e-16 ***
AREA_ORIG75 -1.454e+00 2.562e-01 -5.674 1.39e-08 ***
AREA_ORIG76 -2.450e+00 4.177e-01 -5.866 4.46e-09 ***
AREA_ORIGOther -9.926e-01 7.631e-02 -13.008 < 2e-16 ***
AREA_ORIGUU -1.067e+00 8.522e-02 -12.526 < 2e-16 ***
HEAT_SOURC10 -4.244e-01 1.972e-01 -2.152 0.031368 *
HEAT_SOURC11 -3.284e-01 2.533e-01 -1.296 0.194851
HEAT_SOURC12 -1.106e-01 1.834e-01 -0.603 0.546424
HEAT_SOURC13 -2.146e-01 2.053e-01 -1.045 0.295942
HEAT_SOURC40 -5.954e-01 2.675e-01 -2.226 0.026036 *
HEAT_SOURC43 -3.533e-01 2.753e-01 -1.283 0.199414
HEAT_SOURC60 4.204e-02 2.375e-01 0.177 0.859472
HEAT_SOURC61 -2.616e-02 3.182e-01 -0.082 0.934494
HEAT_SOURCOther -2.552e-01 1.827e-01 -1.397 0.162513
HEAT_SOURCUU -4.886e-02 1.550e-01 -0.315 0.752669
INC_TYPE111 1.325e+01 1.007e+03 0.013 0.989507
INC_TYPE112 1.268e+01 1.007e+03 0.013 0.989956
INC_TYPE120 1.333e+01 1.007e+03 0.013 0.989436
INC_TYPE121 1.305e+01 1.007e+03 0.013 0.989662
INC_TYPE122 1.331e+01 1.007e+03 0.013 0.989459
INC_TYPE123 -9.385e-01 1.375e+03 -0.001 0.999456
HUM_FAC_11 1.343e+01 4.419e+03 0.003 0.997575
HUM_FAC_12 1.338e+01 4.419e+03 0.003 0.997585
HUM_FAC_13 1.181e+01 4.419e+03 0.003 0.997867
HUM_FAC_14 1.365e+01 4.419e+03 0.003 0.997536
HUM_FAC_15 1.528e+01 4.419e+03 0.003 0.997241
HUM_FAC_16 1.271e+01 4.419e+03 0.003 0.997706
HUM_FAC_17 1.292e+01 4.419e+03 0.003 0.997667
HUM_FAC_1N 1.224e+01 4.419e+03 0.003 0.997790
ALARMS_YNO -1.552e-01 7.111e-02 -2.182 0.029104 *
ALARMS_YNY 3.230e-03 6.400e-02 0.050 0.959746
losscatMed_Loss 1.281e+00 1.012e-01 12.660 < 2e-16 ***
losscatMajor_Loss 1.910e+00 1.032e-01 18.500 < 2e-16 ***
losscatTotal_Loss 2.197e+00 1.003e-01 21.904 < 2e-16 ***
daycatEvening 2.340e-01 9.753e-02 2.400 0.016406 *
daycatMid-Day 3.360e-01 1.104e-01 3.044 0.002334 **
daycatMorning 7.029e-01 8.020e-02 8.764 < 2e-16 ***
daycatNight 6.102e-01 7.431e-02 8.211 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 20315 on 154177 degrees of freedom
Residual deviance: 16919 on 154091 degrees of freedom
AIC: 17093
Number of Fisher Scoring iterations: 17
I would be very careful with how you use a logistic regression. Throwing in "the kitchen sink" into a model will usually give some abnormal results. I would start by thinking about the different variables that are important, and only working with those that provide some relevant information. Fitting a regression model is not about throwing all variables in and looking for what sticks, but instead thinking about which variables are important and using some step-wise method to find the important covariates. This, in itself, may solve your problems you mentioned about coefficient direction.
For factor variables, you can always recode them so that you are only working with significant factors. For example, instead of month1-month12
, you could have, month4-month9, other
. Having separate coefficients for each month is not necessary if all months are not significant.
In terms of interactions, yes of course you can specify interactions with month:cause
. Use interactions with caution, you should only add interactions if it make sense to.
I would not recommend using conditional models, for this will reduce your degrees of freedom significantly. And adding interactions can achieve the same affect as conditional models but in a single model.
I would really only use an ensemble model if you know your models are valid. Averaging 2 poor models will not provide better results.
I hope this helps!