Search code examples
rlogistic-regression

Stepwise regression error in R


I want to run a stepwise regression in R to choose the best fit model, my code is attached here:

full.modelfixed <- glm(died_ed ~ age_1 + gender + race + insurance + injury + ais + blunt_pen + 
               comorbid + iss +min_dist + pop_dens_new + age_mdn + male_pct + 
               pop_wht_pct + pop_blk_pct + unemp_pct + pov_100x_npct +
               urban_pct, data = trauma, family = binomial (link = 'logit'), na.action = na.exclude)
reduced.modelfixed <- stepAIC(full.modelfixed, direction = "backward")

There is a error message said

Error in stepAIC(full.modelfixed, direction = "backward") :   
number of rows in use has changed: remove missing values?

Almost every variable in the data has some missing values, so I cannot delete all missing values (data = na.omit(data))

Any idea on how to fix this?

Thanks!!


Solution

  • This should probably be in a stats forum (stats.stackexchange) but briefly there are a number of considerations.

    The main one is that when comparing two models they need to be fitted on the same dataset (i.e you need to be able to nest the models within each other).

    For examples

    glm1 <- glm(Dependent~indep1+indep2+indep3, family = binomial, data = data)
    glm2 <- glm(Dependent~indep2+indep2, family = binomial, data = data)
    

    Now imagine that we are missing values of indep3 but not indep1 or indep2. When we run glm1 we are running it on a smaller dataset - the dataset for which we have the dependent variable and all three independent ones (i.e we exclude any rows where indep3 values are missing).

    When we run glm2 the rows missing a value for indep3 are included because those rows do contain dependent, indep1 and indep2 which are the models in the variable.

    We can no longer directly compare models as they are fitted on different datasets.

    I think broadly you can either 1) Limit to data which is complete 2) If appropriate consider multiple imputation

    Hope that helps.