I have been using stepAIC
in the MASS
package to perform model selection. I would like two independent variables in the model always to be added or removed together in the stepwise selection (both forward and backward directions).
Here's my code:
full.model <- glm(formula=Hospital ~ A + B + C + D1 + D2, family="binomial", data=data)
MASS::stepAIC(full.model, scope = list(upper = full.model, lower = Hospital ~ A),
direction = "both", trace = 1)
Ideally, I would like D1
& D2
(both are numeric variables) always be added or removed together, but it doesn't seem to me that stepAIC has this flexibility. Can anyone shade light on how this can be achieved either using stepAIC
or another R function? Many thanks!
I dont know if there is a parameter that can accommodate this functionality in MASS::stepAIC
, but you can do this yourself somewhat manually.
Say you have these data:
set.seed(123)
n <- 1e3
data <- data.frame(hospital = sample(0:1, n, replace = TRUE),
A = as.factor(sample(LETTERS, n, replace = TRUE)),
B = as.factor(sample(c("Apples", "Oranges", "Grapes", "Bananas"), n, replace = TRUE)),
C = as.factor(sample(c("planes","trains", "automobiles"), n, replace = TRUE)),
D1 = runif(n),
D2 = runif(n))
You can generate a list of all the combinations of predictor variables that include either both or none of D1
and D2
using:
# get all possible combinations
all_list <- do.call(c, lapply(seq_along(names(data)[-1]), combn, x = names(data[-1]), simplify = FALSE))
# keep only those with either both or no `D1` and `D2`
keeps <- lapply(all_list, \(x) sum((x %in% c("D1", "D2"))) != 1)
keeps_list <- all_list[unlist(keeps)]
If you run print(keeps_list)
you will see that it contains the column names for all of the permutations with the D1
/D2
constraint (not doing that here for space considerations).
Once you have this list, you can loop through the list using lapply
to paste them into a formula, run the models, and output the AIC:
formula_list <- lapply(keeps_list, \(x) paste0("hospital ~ ", paste(x, collapse = " + ")))
# Contains all model information
full_models <- lapply(formula_list,\(x) glm(x, family = "binomial", data = data))
Where full_model
contains all the model information from glm()
for all the various combinations.
If you want to get the AIC information for model comparison:
# extracts and organizes AIC for comparison
aic_models <- do.call(rbind, lapply(full_models, \(x) summary(x)$aic))
aic_models <- data.frame(aic_models)
rownames(aic_models) <- lapply(keeps_list, paste, collapse = "-")
# aic_models
# A 1403.604
# B 1392.457
# C 1390.975
# A-B 1408.608
# A-C 1407.239
# B-C 1395.990
# D1-D2 1389.446
# A-B-C 1412.184
# A-D1-D2 1405.016
# B-D1-D2 1394.588
# C-D1-D2 1392.992
# A-B-D1-D2 1410.132
# A-C-D1-D2 1408.626
# B-C-D1-D2 1398.091
# A-B-C-D1-D2 1413.687
Then if desired, you can compare them against the full model (diff
) and identify the parameter combination with the lowest AIC (best
):
aic_models[,"diff"] <- aic_models[,1] - aic_models[nrow(aic_models),1]
aic_models[,"best"] <- aic_models$aic_models == min(aic_models$aic_models)
# aic_models diff best
# A 1419.566 -6.5183258 FALSE
# B 1392.836 -33.2485780 FALSE
# C 1388.043 -38.0417237 TRUE
# A-B 1424.729 -1.3556720 FALSE
# A-C 1418.896 -7.1890668 FALSE
# B-C 1392.839 -33.2454109 FALSE
# D1-D2 1390.085 -35.9994857 FALSE
# A-B-C 1424.130 -1.9541680 FALSE
# A-D1-D2 1421.337 -4.7478673 FALSE
# B-D1-D2 1394.965 -31.1196409 FALSE
# C-D1-D2 1389.972 -36.1126197 FALSE
# A-B-D1-D2 1426.671 0.5868186 FALSE
# A-C-D1-D2 1420.708 -5.3769971 FALSE
# B-C-D1-D2 1394.934 -31.1503318 FALSE
# A-B-C-D1-D2 1426.085 0.0000000 FALSE
I know its a bit of work, but this should get you started!