Search code examples
rdatabaseregressioncluster-analysislogistic-regression

Error with logitmfx in R to calculate robust cluster standard error


I am hoping to run a logit regression which predicts the marginal effect at the mean of family size and age, and the effect of binary indicators (whether an individual is an immigrant, has health insurance, or smokes) on the predicted probability of developing hypertension.

This data comes from a clustered survey, and I am hoping to include robust clustered standard errors in the output.

But when I add the code to include robust cluster SE, I receive an error that the variables in my regression are not longer found and I'm not sure why. Any advice would be great! Thanks.

AGE       IMMIGRANT     FAMSIZE     HLTH_INS    HYPERTEN   SMOKE    PSU
<int>       <dbl>         <int>       <dbl>       <dbl>     <dbl>  <int>
40           0              2          1            0         0      2
23           0              2          1            0         0      1
24           0              2          1            0         0      2
18           0              3          1            1         0      2
30           0              2          1            0         0      2
33           1              6          0            0         0      1

#or if this is an easier output to reproduce:
structure(list(AGE = c(40L, 23L, 24L, 18L, 30L, 33L, 32L, 63L, 
22L, 24L), IMMIGRANT = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 1), FAMSIZE = c(2L, 
2L, 2L, 3L, 2L, 6L, 2L, 1L, 2L, 1L), HLTH_INS = c(1, 1, 1, 1, 
1, 0, 1, 1, 1, 0), HYPERTEN = c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0), 
    SMOKE = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1), PSU = c(2L, 1L, 
    2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L)), row.names = c(NA, -10L), class = "data.frame")


#The regression works without adjusting for clustered SE 
logit<-logitmfx(HYPERTEN~scale(AGE)+IMMIGRANT+scale(FAMSIZE)+HLTH_INS+
                 SMOKE,data=sample,
                atmean=TRUE,robust=T)


#However, when I add in the code to cluster SE I receive the error: "Error in scale(AGE) : object 'AGE' not found" 
logit<-logitmfx(HYPERTEN~scale(AGE)+IMMIGRANT+scale(FAMSIZE)+HLTH_INS+
                 SMOKE,data=sample,
                atmean=TRUE,robust=T,clustervar1="PSU", clustervar2=NULL,!is.null("PSU")) 

Solution

  • Trying to replicate the steps of the function using the source code, Steffen Moritz' solution should indeed work. The problem arises, since first logitmfx instantly calls another function logitmfxest.

    This function has the same arguments, but it also has the following code:

    if(!is.null(clustervar1)){
        if(is.null(clustervar2)){
          if(!(clustervar1 %in% names(data))){
            stop("clustervar1 not in data.frame object")
          }    
          data = data.frame(model.frame(formula, data, na.action=NULL),data[,clustervar1])
          names(data)[dim(data)[2]] = clustervar1
          data=na.omit(data)
        }
        if(!is.null(clustervar2)){
          if(!(clustervar1 %in% names(data))){
            stop("clustervar1 not in data.frame object")
          }    
          if(!(clustervar2 %in% names(data))){
            stop("clustervar2 not in data.frame object")
          }    
          data = data.frame(model.frame(formula, data, na.action=NULL),
                            data[,c(clustervar1,clustervar2)])
          names(data)[c(dim(data)[2]-1):dim(data)[2]] = c(clustervar1,clustervar2)
          data=na.omit(data)
        }
      }
    

    From this the following code gets activated in your case:

    if(!is.null(clustervar1)){
        if(is.null(clustervar2)){  
          data = data.frame(model.frame(formula, data, na.action=NULL),data[,clustervar1])
          names(data)[dim(data)[2]] = clustervar1
          data=na.omit(data)
        }
      }
    

    This redefines "data" to be a data.frame build on the model.frame. But the model frame uses names from you formula, so suddenly column 2 is called scale.AGE. and column 3 is called scale.FAMSIZE..

    This is a big problem since the function then calls a generalized linear model:

    fit = glm(formula, data=data, family = binomial(link = "logit"), x=T, 
            start = start, control = control) 
    

    where it uses your original formula containing scale(AGE) and scale(FAMSIZE), but with the new dataframe with the renamed columns.

    So scaling before inputting should work. And indeed any other function will, as Steffen mentioned, cause the same error, since they will produce a similar renaming of columns when model.frame is called.