Search code examples
rglm

Introduce country fixed effects to glm() and set "reference country"


I need to introduce fixed effects (in this case: country dummies) into an otherwise simple glm() in R.

The country fixed effects variables in my data look like this:

country   country_a   country_b   country_c   y   x   ...
1         1           0           0
1         1           0           0
2         0           1           1
2         0           1           1

Would this be the correct way of technically implementing it? See below... glm(y ~ x + country_a + country_b + country_c, family=binomial(link="logit"))

And if so, how would I set a specific country as reference category? I know that I need to drop one country because of the fact that I would have perfect collinearity if I didn't. And normally this would then be my reference country. But what if other countries "go NA" as well simply due to the fact that they only appear a few times in the data and therefore disappear from the analysis (listwise deletion)? Will country_a still be my reference category if I decide to drop it?

Or do I have to use the Country variable (left column) in the first place and would have to tell glm() somehow that this is a factor with no order? If so, how would I do that?


Solution

  • With data like:

    > d
      country         y         x
    1       1 0.9610213 0.2586365
    2       1 0.8561303 0.5972043
    3       2 0.5463802 0.6412527
    4       2 0.4703876 0.1126319
    

    You can either convert to factor in the glm call:

    > glm(y~factor(country),data=d)
    
    Call:  glm(formula = y ~ factor(country), data = d)
    
    Coefficients:
         (Intercept)  factor(country)2  
              0.9086           -0.4002  
    
    Degrees of Freedom: 3 Total (i.e. Null);  2 Residual
    Null Deviance:      0.1685 
    Residual Deviance: 0.008388     AIC: -7.317
    

    Or make a new column that makes it explicit its not numeric:

    > d$CountryCode = paste0("Country",d$country)
    > d
      country         y         x CountryCode
    1       1 0.9610213 0.2586365    Country1
    2       1 0.8561303 0.5972043    Country1
    3       2 0.5463802 0.6412527    Country2
    4       2 0.4703876 0.1126319    Country2
    > glm(y~CountryCode,data=d)
    
    Call:  glm(formula = y ~ CountryCode, data = d)
    
    Coefficients:
            (Intercept)  CountryCodeCountry2  
                 0.9086              -0.4002  
    
    Degrees of Freedom: 3 Total (i.e. Null);  2 Residual
    Null Deviance:      0.1685 
    Residual Deviance: 0.008388     AIC: -7.317
    

    The missing factor level in the coefficient table is the baseline level - in this case Country1.