I am working in R but have been validating my results in Stata and through doing so have observed that predict
in R is not ignoring my offset from my Poisson model. Let me explain:
I have fitted the following model in R - to model excess mortality as opposed to simply mortality (ExpDeaths is the expected deaths given each subjects age, sex, and period based on the general population and logExpDeaths in the Stata code shown next is just the natural log of ExpDeaths):
model <- glm(Event ~ relevel( as.factor(Period), ref=2) + relevel( as.factor(AgeCat), ref="50-59") + relevel( as.factor(Sex), ref="Female") relevel( as.factor(AlcCombo), ref="0") + relevel( as.factor(ScoreSurv), ref="0") + relevel( as.factor(DrugCombo), ref="0"), offset = (log(ExpDeaths)), data=data, family = poisson)
and have validated the results in Stata using:
poisson Event ib1.Period ib1.Age i.Sex ib1.AlcCombo ib0.ScoreSurv ib0.DrugCombo,
offset(logExpDeaths)
The model results in R and Stata using the above lines of code are exactly the same.
However, when I then try to get the linear predictor for each subject from the model:
In R using the code predict(model, type="link")
I get for my first five values:
-3.812156
-2.472995
-2.499536
-2.299561
-2.217279
However, when I use the code predict lp, xb nooffset
in Stata I get for my first five values:
0.6458265
0.8994361
0.8994361
0.8588267
1.338368
These are the values I want to produce in R, but I've realised the issue is that R is not ignoring the offset, as when I do predict lb, xb
in Stata i.e. keeping the offset based on the expected deaths in, I get the same values as I did in R:
-3.812156
-2.472995
-2.499536
-2.299561
-2.217279
The R documentation for glm (see https://www.math.ucla.edu/~anderson/rw1001/library/base/html/glm.html) states that "Offsets specified by offset will not be included in predictions by predict.glm, whereas those specified by an offset term in the formula will be" i.e. if I use the model as I did, the offset should be ignored:
model <- glm(Event ~ relevel( as.factor(Period), ref=2) + relevel( as.factor(AgeCat), ref="50-59") + relevel( as.factor(Sex), ref="Female") + relevel( as.factor(AlcCombo), ref="0") + relevel( as.factor(ScoreSurv), ref="0") + relevel( as.factor(DrugCombo), ref="0"), offset = (log(ExpDeaths)), data=data, family = poisson)
As opposed to using the below which would mean the offset wasn't ignored when using predict
according to the documentation:
model <- glm(Event ~ relevel( as.factor(Period), ref=2) + relevel( as.factor(AgeCat), ref="50-59") + relevel( as.factor(Sex), ref="Female") + relevel( as.factor(AlcCombo), ref="0") + relevel( as.factor(ScoreSurv), ref="0") + relevel( as.factor(DrugCombo), ref="0") + offset(log(ExpDeaths)), data=data, family = poisson)
However, I get the exact same model (which I would expect) and linear predictors (which should be different) using both, which leads me to conclude that neither way of writing the model in R is leading to the offset being ignored when using predict
.
I know I can just use Stata to get the desired results, but I really want to know how to get the Stata results using R just for my own sanity i.e. how to get predict to ignore the offset using R.
When you call nooffset
you are simply subtracting the offset from the linear predictor.
use https://data.princeton.edu/wws509/datasets/ceb.dta,clear
gen y=round(mean*n,1)
gen os=log(n)
poisson y i.res, offset(os)
predict xb, xb
predict lp, xb nooffset
list in 1/6,clean
i dur res educ mean var n y os xb lp
1. 1 0-4 Suva None .5 1.14 8 4 2.079442 3.284039 1.204598
2. 2 0-4 Suva Lower primary 1.14 .73 21 24 3.044523 4.24912 1.204598
3. 3 0-4 Suva Upper primary .9 .67 42 38 3.73767 4.942267 1.204598
4. 4 0-4 Suva Secondary+ .73 .48 51 37 3.931826 5.136423 1.204598
5. 5 0-4 Urban None 1.17 1.06 12 14 2.484907 3.833794 1.348887
6. 6 0-4 Urban Lower primary .85 1.59 27 23 3.295837 4.644724 1.348887
Here, notice that I can replicate the stata call predict lp, xb nooffset
, simply by subtracting os
from xb
(See ceb$lp=ceb$xb-ceb$os
)
library(foreign)
ceb<- read.dta("http://data.princeton.edu/wws509/datasets/ceb.dta")
ceb$y <- round(ceb$mean*ceb$n, 0)
ceb$os <- log(ceb$n)
m1 = glm(y~res, offset=os,data=ceb,family="poisson")
ceb$xb=predict(m1, type="link")
ceb$lp=ceb$xb-ceb$os
head(ceb)
i dur res educ mean var n y os xb lp
1 1 0-4 Suva None 0.50 1.14 8 4 2.079442 3.284039 1.204598
2 2 0-4 Suva Lower primary 1.14 0.73 21 24 3.044522 4.249120 1.204598
3 3 0-4 Suva Upper primary 0.90 0.67 42 38 3.737670 4.942267 1.204598
4 4 0-4 Suva Secondary+ 0.73 0.48 51 37 3.931826 5.136423 1.204598
5 5 0-4 Urban None 1.17 1.06 12 14 2.484907 3.833794 1.348887
6 6 0-4 Urban Lower primary 0.85 1.59 27 23 3.295837 4.644724 1.348887