I am using a generalized additive model built using the bam()
function in the mgcv
R package, to predict probabilities for a binary response.
I seem to get slightly different predictions for the same input data, depending on the make-up of the newdata
table provided, and don't understand why.
The model was built using a formula like this:
model <- bam(response ~ categorical_predictor1 + s(continuous_predictor, bs='tp'),
data=data,
family="binomial",
select=TRUE,
discrete=TRUE,
nthreads = 16)
I have several more categorical and continuous predictors, but to save space I only mention two in the above formula.
I then predict like this:
predictions <- predict(model,
newdata = newdata,
type="response")
I want to make predictions for about 2.5 million rows, but during my testing I predicted for a subset of 250,000.
Each time I use the model to predict for that subset (i.e. newdata=subset
) I get the same outputs - this is reproducible. However, if I use the model to predict that same subset within a the full table of 2.5 million rows (i.e. newdata=full_data
), then I get slightly different predictions for that subset of 250,000 than when I predict them separately.
I always thought that each row is predicted in turn, based on the predictors provided, so can't understand why the predictions change with the context of the "newdata". This does not happen if I predict using a standard glm, or a random forest, so I assume it's something specific to gams or the mgcv
package.
Sorry, I haven't been able to provide a reproducible example - my datasets are large, and I'm not sure if the same thing would happen with a small example dataset.
From the predict.bam
help:
"When discrete=TRUE the prediction data in newdata is discretized in the same way as is done when using discrete fitting methods with bam. However the discretization grids are not currently identical to those used during fitting. Instead, discretization is done afresh for the prediction data. This means that if you are predicting for a relatively small set of prediction data, or on a regular grid, then the results may in fact be identical to those obtained without discretization. The disadvantage to this approach is that if you make predictions with a large data frame, and then split it into smaller data frames to make the predictions again, the results may differ slightly, because of slightly different discretization errors."
You likely can't switch to gam or use discrete=FALSE because you need the speed. But you'll have to deal with some small differences in exchange. From the help, it sounds like you might be able to minimize this by choosing the subsets carefully, but you won't be able to eliminate it completely.