Search code examples
rpanel-dataemmeansmarginal-effects

How to run the predicted probabilities (or average marginal effects) for individuals fixed effects in panel data using R?


These are three different ways to run an individual fixed effect method which gives more or less the same results (see below). My main question is how to get predictive probabilities or average marginal effects using the second model (model_plm) or the third model(model_felm). I know how to do it using the first model (model_lm) and show an example below using ggeffects, but that only works when i have a small sample.

As i have over a million individual, my model only works using model_plm and model_felm. If i use model_lm, it takes a lot of time to run with one million individuals since they are controlled for in the model. I also get the following error: Error: vector memory exhausted (limit reached?). I checked many threads on StackOverflow to work around that error but nothing seems to solve it.

I was wondering whether there is an efficient way to work around this issue. My main interest is to extract the predicted probabilities of the interaction residence*union. I usually extract predictive probabilities or average marginal effects using one of these packages: ggeffects,emmeans or margins.

library(lfe)
library(plm)
library(ggeffects)
data("Males")  

model_lm = lm(wage ~ exper + residence+health + residence*union +factor(nr)-1, data=Males)
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr", "year"), data=Males)
model_felm = felm(wage ~ exper + residence + health + residence*union | nr, data= Males)

pred_ggeffects <- ggpredict(model_lm, c("residence","union"), 
                    vcov.fun = "vcovCL", 
                    vcov.type = "HC1",
                    vcov.args = list(cluster = Males$nr))

Solution

  • I tried adjusting formula/datasets to get emmeans and plm to play nice. Let me know if there's something here. I realized the biglm answer wasn't going to cut it for a million individuals after some testing.

    library(emmeans)
    library(plm)
    data("Males")  
    
    ## this runs but we need to get an equivalent result with expanded formula
    ## and expanded dataset
    model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr"), data=Males)
    
    ## expanded dataset
    Males2 <- data.frame(wage=Males[complete.cases(Males),"wage"],
                         model.matrix(wage ~ exper + residence + health + residence*union, Males), 
                         nr=Males[complete.cases(Males),"nr"])
    
    
    (fmla2 <- as.formula(paste("wage ~ ", paste(names(coef(model_plm)), collapse= "+"))))
    
    ## expanded formula
    model_plm2 <- plm(fmla2,
                      model = "within",
                      index=c("nr"), 
                      data=Males2)
    
    (fmla2_rg <- as.formula(paste("wage ~ -1 +", paste(names(coef(model_plm)), collapse= "+"))))
    
    plm2_rg <- qdrg(fmla2_rg,
                    data = Males2,
                    coef = coef(model_plm2),
                    vcov = vcov(model_plm2),
                    df = model_plm2$df.residual)
    
    plm2_rg
    
    ### when all 3 residences are 0, that's `rural area`
    ### then just pick the rows when one of the residences are 1
    emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
    

    Which gives, after some row-deletion:

    > ### when all 3 residences are 0, that's `rural area`
    > ### then just pick the rows when one of the residences are 1
    > emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
     residencenorth_east residencenothern_central residencesouth unionyes emmean     SE   df lower.CL upper.CL
                       0                        0              0        0 0.3777 0.0335 2677  0.31201    0.443
                       1                        0              0        0 0.3301 0.1636 2677  0.00929    0.651
                       0                        1              0        0 0.1924 0.1483 2677 -0.09834    0.483
                       0                        0              1        0 0.2596 0.1514 2677 -0.03732    0.557
                       0                        0              0        1 0.2875 0.1473 2677 -0.00144    0.576
                       1                        0              0        1 0.3845 0.1647 2677  0.06155    0.708
                       0                        1              0        1 0.3326 0.1539 2677  0.03091    0.634
                       0                        0              1        1 0.3411 0.1534 2677  0.04024    0.642
    
    Results are averaged over the levels of: healthyes 
    Confidence level used: 0.95