I'm working on a hurdle model and ran into a question I can't quite figure out. It was my understanding that the overall response prediction of the hurdle is the multiplication of the count prediction by the probability prediction. I.e., the overall response has to be smaller or equal to the count prediction. However, in my data, the response prediction is higher than the count prediction, and I can't figure out why.
Here's a similar result for a toy model (code adapted from here):
library("pscl")
data("RecreationDemand", package = "AER")
## model
m <- hurdle(trips ~ quality | ski, data = RecreationDemand, dist = "negbin")
nd <- data.frame(quality = 0:5, ski = "no")
predict(m, newdata = nd, type = "count")
predict(m, newdata = nd, type = "response")
Why is it that the counts are higher than the responses?
added comparison to glm.nb
Also - I was under the impression that the count part of the hurdle model should give identical predictions to a count-model of only positive values. When I try that, I get completely different values. What am I missing??
library(MASS)
m.nb <- glm.nb(trips ~ quality, data = RecreationDemand[RecreationDemand$trips > 0,])
predict(m, newdata = nd, type = "count") ## hurdle
predict(m.nb, newdata = nd, type = "response") ## positive counts only
The last question is the easiest to answer. The "count" part of the hurdle modle is not simply a standard count model (including a positive probability for zeros) but a zero-truncated count model (where zeros cannot occur).
Using the countreg
package from R-Forge you can fit the model you attempted to fit with glm.nb
in your example. (Alternatively, VGAM
or gamlss
could also be used to fit the same model.)
library("countreg")
m.truncnb <- zerotrunc(trips ~ quality, data = RecreationDemand,
subset = trips > 0, dist = "negbin")
cbind(hurdle = coef(m, model = "count"), zerotrunc = coef(m.truncnb), negbin = coef(m.nb))
## hurdle zerotrunc negbin
## (Intercept) 0.08676189 0.08674119 1.75391028
## quality 0.02482553 0.02483015 0.01671314
Up to small numerical differences the first two models are exactly equivalent. The non-truncated model, however, has to compensate the lack of zeros by increasing the intercept and dampening the slope parameter, which is clearly not appropriate here.
As for the predictions, one can distinguish three quantities:
See also Section 2.2 and Appendix C in vignette("countreg", package = "pscl")
for more details and formulas. predict(..., type = "count")
computes item 1 from above where predict(..., type = "response")
computes item 3 for a hurdle
model and item 2 for a zerotrunc
model.