Search code examples
rfixestr-marginaleffects

How does marginaleffects compute groupwise predictions in the presence of fixed effects?


I want to compute predictions for a fixed effects ols regression by group with the marginaleffects package. When I generate new data for the prediction and do not specify the specific level of the fixed effects variables that I want the prediction to use, the prediction function calculates the values at am==0, gear==0 and carb==4 (see below).

Where do these values come from? What's the most sensible way to calculate average predictions in the presence of fixed effects in this example?

library(tidyverse)
library(fixest)
library(marginaleffects)

dat <- mtcars

m1 <- dat %>% 
  mutate(cyl = as_factor(cyl)) %>% 
  feols(mpg ~ hp*cyl | am + gear + carb)

etable(m1)
#>                               m1
#> Dependent Var.:              mpg
#>                                 
#> hp              -0.1316 (0.0504)
#> cyl6              -47.87 (27.42)
#> cyl8              -21.31 (14.80)
#> hp x cyl6        0.4434 (0.2619)
#> hp x cyl8        0.1471 (0.0665)
#> Fixed-Effects:  ----------------
#> am                           Yes
#> gear                         Yes
#> carb                         Yes
#> _______________ ________________
#> S.E.: Clustered           by: am
#> Observations                  32
#> R2                       0.88892
#> Within R2                0.38565
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

pred1 <- predictions(m1, 
                     newdata = datagrid(
                       hp = c(min(dat$hp):max(dat$hp)), 
                       cyl = levels(dat$cyl))) 
pred1
#> 
#>  Estimate Std. Error     z Pr(>|z|)   S  2.5 % 97.5 % cyl am gear carb  hp
#>      9.76       8.72 1.119    0.263 1.9  -7.33   26.9   8  0    3    4  52
#>      9.78       8.60 1.136    0.256 2.0  -7.09   26.6   8  0    3    4  53
#>      9.79       8.49 1.154    0.249 2.0  -6.84   26.4   8  0    3    4  54
#>      9.81       8.37 1.172    0.241 2.1  -6.60   26.2   8  0    3    4  55
#>      9.82       8.25 1.190    0.234 2.1  -6.35   26.0   8  0    3    4  56
#> --- 274 rows omitted. See ?avg_predictions and ?print.marginaleffects --- 
#>     14.09      23.89 0.590    0.555 0.8 -32.73   60.9   8  0    3    4 331
#>     14.11      24.01 0.588    0.557 0.8 -32.94   61.2   8  0    3    4 332
#>     14.12      24.12 0.586    0.558 0.8 -33.15   61.4   8  0    3    4 333
#>     14.14      24.24 0.583    0.560 0.8 -33.37   61.6   8  0    3    4 334
#>     14.16      24.36 0.581    0.561 0.8 -33.58   61.9   8  0    3    4 335
#> Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, mpg, cyl, am, gear, carb, hp

Created on 2023-08-22 with reprex v2.0.2


Solution

  • You ask:

    When I generate new data for the prediction and do not specify the specific level of the fixed effects variables that I want the prediction to use, the prediction function calculates the values at am==0, gear==0 and carb==4

    (I think you meant gear=6. Must be a typo.)

    When a variable is not explicitly specified by name in datagrid, the value is picked by a function which depends on the variable class. Please read ?datagrid and check out the FUN_character, FUN_numeric, and other arguments to that function.

    In fixest models, the variables after the | are treated as factor variables, so the default is to take the mode. When there is more than one mode, the choice is arbitrary (I think it’s the first one to appear in the dataset, but not 100% sure). The user can of course customize this by specifying the relevant FUN_* argument.

    You ask:

    What’s the most sensible way to calculate average predictions in the presence of fixed effects in this example?

    In my opinion, it is typically most sensible to not specify the newdata argument at all when computing average predictions by group. When you leave, newdata=NULL, the function will compute predictions for each row of the original data, and then average them by group. This default behavior gives you predicted values averaged across the empirical distribution of the covariates.

    That said, this is a statistical question whose answer may vary depending on your specific research context and application. If I were you, I would consult a local statistician or open a new question on Cross Validated (not Stack Overflow) with a lot more detail about the specific research question that you hope to answer.

    library(fixest)
    library(marginaleffects)
    
    dat <- mtcars
    mod <- feols(mpg ~ hp*cyl | am + gear + carb, data = dat)
    
    avg_predictions(mod, by = c("am", "cyl"))
    # 
    #  am cyl Estimate Std. Error     z Pr(>|z|)   S   2.5 % 97.5 %
    #   0   4     23.8       12.1 1.973   0.0484 4.4   0.163   47.4
    #   0   6     18.8       14.2 1.321   0.1863 2.4  -9.088   46.7
    #   0   8     14.9       17.2 0.867   0.3858 1.4 -18.807   48.7
    #   1   4     27.8       11.8 2.360   0.0183 5.8   4.720   51.0
    #   1   6     20.4       15.4 1.327   0.1845 2.4  -9.746   50.6
    #   1   8     16.5       22.1 0.747   0.4550 1.1 -26.859   59.9
    # 
    # Columns: am, cyl, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high