Search code examples
rlogistic-regressionsplinegam

What are the differences between directly plotting the fit function and plotting the predicted values(they have same shape but different ranges)?


I am trying to learn gam() in R for a logistic regression using spline on a predictor. The two methods of plotting in my code gives the same shape but different ranges of response in the logit scale, seems like an intercept is missing in one. Both are supposed to be correct but, why the differences in range?

library(ISLR)
attach(Wage)
library(gam)

gam.lr = gam(I(wage >250) ~ s(age), family = binomial(link = "logit"), data = Wage)
agelims = range(age)
age.grid = seq(from = agelims[1], to = agelims[2])
pred=predict(gam.lr, newdata = list(age = age.grid), type = "link")
par(mfrow = c(2,1))
plot(gam.lr)
plot(age.grid, pred)

I expected that both of the methods would give the exact same plot. plot(gam.lr) plots the additive effects of each component and since here there's only one so it is supposed to give the predicted logit function. The predict method is also giving me estimates in the link scale. But the actual outputs are on different ranges. The minimum value of the first method is -4 while that of the second is less than -7.


Solution

  • The first plot is of the estimated smooth function s(age) only. Smooths are subject to identifiability constraints as in the basis expansion used to parametrise the smooth, there is a function or combination of functions that are entirely confounded with the intercept. As such, you can't fit the smooth and an intercept in the same model as you could subtract some value from the intercept and add it back to the smooth and you have the same fit but different coefficients. As you can add and subtract an infinity of values you have an infinite supply of models, which isn't helpful.

    Hence identifiability constraints are applied to the basis expansions, and the one that is most useful is to ensure that the smooth sums to zero over the range of the covariate. This involves centering the smooth at 0, with the intercept then representing the overall mean of the response.

    So, the first plot is of the smooth, subject to this sum to zero constraint, so it straddles 0. The intercept in this model is:

    > coef(gam.lr)[1]
    (Intercept) 
        -4.7175 
    

    If you add this to values in this plot, you get the values in the second plot, which is the application of the full model to the data you supplied, intercept + f(age).

    This is all also happening on the link scale, the log odds scale, hence all the negative values.