Search code examples
rpredictionimputationr-micemarginal-effects

Obtaining predictions from a pooled imputation model


I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.

#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
  1. Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp

#-------------------multiple imputation by chained equation--------#
  #generate 20 imputated datasets
newresult<-mice(result,m=20)
  1. Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
  1. Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
  1. Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))

This throws the following error:

Error in check_at_names(names(data), at) : 
  Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty

I thought of two solutions:

a) changing the class object to make it fit prediction()'s requirements

b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements

However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.


Solution

  • You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)

    In addition to @Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:

    library(emmeans)
    emmeans(model, ~pregnant, at=list(pregnant=3))
    

    marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)

    library(marginaleffects)
    fit_reg <- function(dat) {
        mod <- glm(test~pregnant+glucose+diastolic+
                   triceps+age+bmi, 
                   data = dat, family = binomial)
        out <- predictions(mod, newdata = datagrid(pregnant=3))
        return(out)
    }
    dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
    dat_mice <- complete(dat_mice, "all")
    mod_imputation <- lapply(dat_mice, fit_reg)
    mod_imputation <- pool(mod_imputation)