Search code examples
rmlogit

R mlogit model, , missing value where TRUE/FALSE needed, 20 invalid factor level warnings


I'm trying to run a multinomial logistic regression using the mlogit package in R. I've uploaded the data here https://drive.google.com/file/d/0B_o3xTWAYdbuRGw0dzNFRzd2NEk/view?usp=sharing.

The data contains two different choice variables which I want to run the same model on. I run the first model like so:

lfsm1 <- mlogit.data(lfs.models, shape="wide", choice="PWK")
f1 <- mFormula(PWK~1 | MIGGRP+SEX+AGE+EDU)
m1 <- mlogit(f1, lfsm1, weights=PWT14)
summary(m1)

This model runs without issues. Then I run the same exact model on the other choice variable:

lfsm2 <- mlogit.data(lfs.models, shape="wide", choice="multi")
f2 <- mFormula(multi~1 | MIGGRP+SEX+AGE+EDU)
m2 <- mlogit(f1, lfsm2, weights=PWT14)

I get the following errors:

Error in if (is.null(initial.value) || lnl <= initial.value) break : 
missing value where TRUE/FALSE needed
In addition: There were 20 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In `[<-.factor`(`*tmp*`, is.na(x), value = FALSE) :
   invalid factor level, NA generated

And that warning message repeats 20x.

I'm not sure what either of these errors mean in the context of my model. A previous post (mlogit: missing value where TRUE/FALSE needed) suggests that my first error occurs because my data are not in wide format, or because there are some individuals who do not select any of the alternatives. In my case neither of these explanations can be right. What I've seen about the warning messages suggest mlogit is reacting badly to variables being factors or numeric. But I don't quite understand why this would matter in a multinomial regression context, or how the problem only occurred twenty times in such a large dataset.

Any suggestions would be most appreciated!


Solution

  • Try

    m2 <- mlogit(f2, lfsm2, weights=PWT14)
    

    Note the f2 in the call to mlogit.

    In your second call to mlogit.data, you have specified that multi is the choice variable, and the data are prepared accordingly. Yet, in the formula that you are using, f1, the dependent variable is specified as PWK, so that mlogit is expecting a dataframe with one row for each alternative as defined by PMK, not multi.