Search code examples
rlogistic-regression

Fixed Effects model where the response variable is bounded between 0 and 1


I ran a fixed effect model on the following data set:

> data
# A tibble: 13,646 × 7
# Groups:   age [16]
   account_id  time default_rate r12_gdp_bl r12_gdp_st   age lifecycle
        <dbl> <dbl>        <dbl>      <dbl>      <dbl> <dbl>     <dbl>
 1     400293  2005      NA           0.848      0.848    17   0.00238
 2     400293  2006      NA           3.81       3.81     16 NaN      
 3     400293  2007      NA           3.34       3.34     15   0.00694
 4     400293  2008       0.0058      0.806      0.806    14   0.00897
 5     400293  2009       0.0165     -5.22      -5.22     13   0.0325 
 6     400293  2010       0.0001      3.79       3.79     12   0.0115 
 7     400293  2011       0.0165      3.34       3.34     11   0.0148 
 8     400293  2012       0.0136      0.892      0.892    10   0.0126 
 9     400293  2013       0.0201      0.531      0.531     9   0.0144 
10     400293  2014      NA           1.74      -0.867     8 NaN      
# … with 13,636 more rows

where the credit default_rate is the response variable, and age and GDP are the fixed effects. The model is the following:

> library(fixest)
> mod_bl <- feols(default_rate ~ account_id | age^r12_gdp_bl, data = data)
NOTE: 11,274 observations removed because of NA values (LHS: 11,274).
> summary(mod_bl)
OLS estimation, Dep. Var.: default_rate
Observations: 2,372 
Fixed-effects: age^r12_gdp_bl: 8
Standard-errors: Clustered (age^r12_gdp_bl) 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.033706     Adj. R2: 0.031225
                 Within R2: 0.002527

The model seems meaningless but that's what I was asked to do. The original default_rate is a probability, therefore it is bounded between 0 and 1.

> summary(data$default_rate)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.002   0.006   0.015   0.015   0.510   11274 

However, fitted values of the model are not bounded between 0 and 1 as they are supposed to be.

> summary(mod_bl$fitted.values)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.04275  0.01210  0.01375  0.01497  0.01502  0.03273 

How can I fix the model in order to obtain responses bounded between 0 and 1?


Solution

  • The most natural (IMO) thing to do would be to fit a logistic regression instead of an OLS model. You can do this with fixest::feglm ... since your data are not binary (0/1) values this is actually a fractional logistic regression model. The example below uses binary data because that's what I have handy, but you should be able to modify your example by substituting feglm for feols and adding family = "quasibinomial" ...

    data(Contraception, package = "mlmRev")
    ## 'use' is initially a factor ("N", "Y"), convert to 0/1
    cc <- transform(Contraception, nUse = as.numeric(use)-1)
    m1 <- feglm(nUse ~ livch + age + urban | district, data = cc,
                family = "quasibinomial")
    
     summary(predict(m1))
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.02414 0.25616 0.38540 0.39696 0.52402 0.82794 
    

    If your supervisor insists on an OLS fit, they are asking for a linear probability model and should be willing to live with the predictions outside of [0,1] ...