Search code examples
rstatisticsr-micemlogit

Multiple imputation and mlogit for a multinomial regression


I am trying to run a multinomial regression with imputed data. I can do this with the nnet package, however I want to use mlogit. Using the mlogit package I keep getting the following error "Error in 1:nrow(data) : argument of length 0".

So making the data

library(mlogit)
library(nnet)
library(tidyverse)
library(mice)

df <- data.frame(vax = sample(1:6, 500, replace = T),
                 age = runif(500, 12, 18),
                 var1 = sample(1:2, 500, replace = T),
                 var2 = sample(1:5, 500, replace = T))

# Create missing data using the mice package:
df2 <- ampute(df, prop = 0.15)
df3 <- df2$amp

df3$vax <- as.factor(df3$vax)
df3$var1 <- as.factor(df3$var1)
df3$var2 <- as.factor(df3$var2)

# Inpute missing data:
df4 <- mice(df3, m = 5, print = T, seed = 123)

It works using nnet's multinom:

multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, data = df, model = T))
summary(pool(multinomtest))

But throws up an error when I try to reshape the data into mlogit format

test <- with(df4, dfidx(data = df4, choice = "vax", shape = "wide"))

Does anyone have any idea how I can get the imputed data into mlogit format, or even whether mlogit has compatibility with mice or any other imputation package?


Solution

  • Answer

    You are using with.mids incorrectly, and thus both lines of code are wrong; the multinom line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply:

    analyses <- lapply(seq_len(df4$m), function(i) {
      data.i <- complete(df4, i)
      data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
      mlogit(vax ~ 1 | age + var1 + var2, 
             data = data.idx, 
             reflevel = "1", 
             nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
    })
    test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
    oldClass(test) <- c("mira", "matrix")
    summary(pool(test))
    

    How with.mids works

    When you apply with to a mids object (AKA the output of mice::mice), then you are actually calling with.mids.

    If you use getAnywhere(with.mids) (or just type mice:::with.mids), you'll find that it does a couple of things:

    1. It loops over all imputed datasets.
    2. It uses complete to get one dataset.
    3. It runs the expression with the dataset as the environment.

    The third step is the problem. For functions that use formulas (like lm, glm and multinom), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data variable.


    The problems

    This is where both your problems derive from:

    • In your multinom call, you set the data variable to be df. Hence, you are actually running your multinom on the original df, NOT the imputed dataset!

    • In your dfidx call, you are again filling in data directly. This is also wrong. However, leaving it empty also gives an error. This is because with.mids doesn't fill in the data argument, but only the environment. That isn't sufficient for you.


    Fixing multinom

    The solution for your multinom line is simple: just don't specify data:

    multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
    summary(pool(multinomtest))
    

    As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.


    Fixing dfidx (and mlogit)

    We cannot do this with with.mids, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply:

    analyses <- lapply(seq_len(df4$m), function(i) {
      data.i <- complete(df4, i)
      data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
      mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
    })
    

    From there, all we have to do is make something that looks like a mira object, so that we can still use pool:

    test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
    oldClass(test) <- c("mira", "matrix")
    summary(pool(test))