Search code examples
rstataoffsetpredictpoisson

Why is predict not ignoring my offset from a Poisson model in R no matter how I enter the offset into the model?


I am working in R but have been validating my results in Stata and through doing so have observed that predict in R is not ignoring my offset from my Poisson model. Let me explain:

I have fitted the following model in R - to model excess mortality as opposed to simply mortality (ExpDeaths is the expected deaths given each subjects age, sex, and period based on the general population and logExpDeaths in the Stata code shown next is just the natural log of ExpDeaths):

model <- glm(Event ~ relevel( as.factor(Period), ref=2) + relevel( as.factor(AgeCat), ref="50-59") + relevel( as.factor(Sex), ref="Female") relevel( as.factor(AlcCombo), ref="0") + relevel( as.factor(ScoreSurv), ref="0") + relevel( as.factor(DrugCombo), ref="0"), offset = (log(ExpDeaths)), data=data, family = poisson)

and have validated the results in Stata using:

poisson Event ib1.Period ib1.Age i.Sex ib1.AlcCombo ib0.ScoreSurv ib0.DrugCombo, 
   offset(logExpDeaths)

The model results in R and Stata using the above lines of code are exactly the same.

However, when I then try to get the linear predictor for each subject from the model:

In R using the code predict(model, type="link") I get for my first five values: -3.812156 -2.472995 -2.499536 -2.299561 -2.217279

However, when I use the code predict lp, xb nooffset in Stata I get for my first five values: 0.6458265 0.8994361 0.8994361 0.8588267 1.338368

These are the values I want to produce in R, but I've realised the issue is that R is not ignoring the offset, as when I do predict lb, xb in Stata i.e. keeping the offset based on the expected deaths in, I get the same values as I did in R: -3.812156 -2.472995 -2.499536 -2.299561 -2.217279

The R documentation for glm (see https://www.math.ucla.edu/~anderson/rw1001/library/base/html/glm.html) states that "Offsets specified by offset will not be included in predictions by predict.glm, whereas those specified by an offset term in the formula will be" i.e. if I use the model as I did, the offset should be ignored:

model <- glm(Event ~ relevel( as.factor(Period), ref=2) + relevel( as.factor(AgeCat), ref="50-59") + relevel( as.factor(Sex), ref="Female") + relevel( as.factor(AlcCombo), ref="0") + relevel( as.factor(ScoreSurv), ref="0") + relevel( as.factor(DrugCombo), ref="0"), offset = (log(ExpDeaths)), data=data, family = poisson)

As opposed to using the below which would mean the offset wasn't ignored when using predict according to the documentation:

model <- glm(Event ~ relevel( as.factor(Period), ref=2) + relevel( as.factor(AgeCat), ref="50-59") + relevel( as.factor(Sex), ref="Female") + relevel( as.factor(AlcCombo), ref="0") + relevel( as.factor(ScoreSurv), ref="0") + relevel( as.factor(DrugCombo), ref="0") + offset(log(ExpDeaths)), data=data, family = poisson)

However, I get the exact same model (which I would expect) and linear predictors (which should be different) using both, which leads me to conclude that neither way of writing the model in R is leading to the offset being ignored when using predict.

I know I can just use Stata to get the desired results, but I really want to know how to get the Stata results using R just for my own sanity i.e. how to get predict to ignore the offset using R.


Solution

  • When you call nooffset you are simply subtracting the offset from the linear predictor.

    Stata

    use https://data.princeton.edu/wws509/datasets/ceb.dta,clear
    gen y=round(mean*n,1)
    gen os=log(n)
    poisson y i.res, offset(os)
    predict xb, xb
    predict lp, xb nooffset
    list in 1/6,clean
    
    
           i   dur     res            educ   mean    var    n    y         os         xb         lp  
      1.   1   0-4    Suva            None     .5   1.14    8    4   2.079442   3.284039   1.204598  
      2.   2   0-4    Suva   Lower primary   1.14    .73   21   24   3.044523    4.24912   1.204598  
      3.   3   0-4    Suva   Upper primary     .9    .67   42   38    3.73767   4.942267   1.204598  
      4.   4   0-4    Suva      Secondary+    .73    .48   51   37   3.931826   5.136423   1.204598  
      5.   5   0-4   Urban            None   1.17   1.06   12   14   2.484907   3.833794   1.348887  
      6.   6   0-4   Urban   Lower primary    .85   1.59   27   23   3.295837   4.644724   1.348887
    

    R

    Here, notice that I can replicate the stata call predict lp, xb nooffset, simply by subtracting os from xb (See ceb$lp=ceb$xb-ceb$os)

    library(foreign)
    ceb<- read.dta("http://data.princeton.edu/wws509/datasets/ceb.dta")
    ceb$y  <- round(ceb$mean*ceb$n, 0)
    ceb$os <- log(ceb$n)
    m1 = glm(y~res, offset=os,data=ceb,family="poisson")
    ceb$xb=predict(m1, type="link")
    ceb$lp=ceb$xb-ceb$os 
    head(ceb)
    
      i dur   res          educ mean  var  n  y       os       xb       lp
    1 1 0-4  Suva          None 0.50 1.14  8  4 2.079442 3.284039 1.204598
    2 2 0-4  Suva Lower primary 1.14 0.73 21 24 3.044522 4.249120 1.204598
    3 3 0-4  Suva Upper primary 0.90 0.67 42 38 3.737670 4.942267 1.204598
    4 4 0-4  Suva    Secondary+ 0.73 0.48 51 37 3.931826 5.136423 1.204598
    5 5 0-4 Urban          None 1.17 1.06 12 14 2.484907 3.833794 1.348887
    6 6 0-4 Urban Lower primary 0.85 1.59 27 23 3.295837 4.644724 1.348887