Search code examples
rggplot2smoothinggam

How to plot the smooths from multiple GAMs on the same plot


I am quite new to all this so not sure if this is a simple question!

I am trying to plot the smooths from two GAM models on the same axes. I can see that there have already been various questions similar to this in the past, but I can't find a specific answer to my question.

I have data (dset) which looks has about 600 study participants (studynr), each with somewhere between 1 and 15 study visits (one row of dset for each visit). At each visit a complete list of their prescribed medications was captured. I have categorised these medications into two categories and counted them (TOTAL_P and TOTAL_S) and am interested in the trajectories in the total number of these medications over time. Time is negative as it is in years before a certain event.

I have tried to produce a small amount of dummy data here, but the full dataset is much larger, with differing numbers of visits per participant and the time is not in integers values.

dset <- data.frame(studynr=rep(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), each =10),
                      time = rep(c(-9,-8,-7,-6,-5,-4,-3,-2,1,0), times=20), 
                    TOTAL_P = sample(1:10, 200, replace=T),
                    TOTAL_S = sample(1:10, 200, replace=T))

I then have two GAM models (using bam() as my dset is quite large)

bam_s <- bam(TOTAL_S ~ s(time)
           + s(studynr, bs = 're') 
           + s(studynr, time, bs = 're'),
           data = dset, method = "REML")

bam_p <- bam(TOTAL_P ~ s(time)
             + s(studynr, bs = 're')
             + s(studynr, time, bs = 're'),
             data = dset, method = "REML")

What I want to do is plot the smooth from each model on one combined plot, with the number of medications on the y-axis and time on the x-axis.

I have tried a number of different options using {gratia}'s compare_smooth() and plot.gam() without success. Ultimately I want to try to get the plot in a ggplot2 format, the main issue I am facing is getting the y-axis scale correct.

Thanks.


Solution

  • The brute force way is to simply predict from both models, excluding both random effect terms and then plot:

    library("gratia")
    library("mgcv")
    library("dplyr")
    library("ggplot2")
    
    N <- 100
    ds <- data_slice(dset, time = evenly(time, n = N)) |>
      select(-TOTAL_P, -TOTAL_S)
    
    fv_s <- fitted_values(bam_s, data = ds,
                          exclude = c("s(studynr)", "s(studynr,time)"))
    fv_p <- fitted_values(bam_p, data = ds,
                          exclude = c("s(studynr)", "s(studynr,time)"))
    
    fv <- fv_s |>
      bind_rows(fv_p) |>
      mutate(category = rep(c("S", "P"), each = N))
    
    fv |>
      ggplot(aes(x = time, y = fitted, group = category)) +
      geom_ribbon(aes(ymin = lower, ymax = upper, fill = category),
                  alpha = 0.2) +
      geom_line(aes(colour = category)) +
      labs(y = "Count")
    

    enter image description here

    But it would be so much better to actually fit a model that would allow a statistical comparison:

    library("tidyr")
    
    df <- dset |>
      tidyr::pivot_longer(cols = c(-studynr, -time),
        values_to = "count", names_to = "category", names_prefix = "TOTAL_",
        names_ptypes = list(category = factor()))
    
    m <- bam(count ~ category + s(time, by = category) +
               s(studynr, bs = 're') + s(studynr, time, bs = 're'),
             data = df, family = poisson(), method = "fREML")
    

    Now we have:

    > summary(m)                                                                  
    
    Family: poisson 
    Link function: log 
    
    Formula:
    count ~ category + s(time, by = category) + s(studynr, bs = "re") + 
        s(studynr, time, bs = "re")
    
    Parametric coefficients:
                Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  1.73291    0.02975  58.254   <2e-16 ***
    categoryS    0.01352    0.04192   0.323    0.747    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Approximate significance of smooth terms:
                            edf Ref.df Chi.sq p-value
    s(time):categoryP 3.262e+00  4.034  7.609   0.114
    s(time):categoryS 2.961e+00  3.662  3.912   0.358
    s(studynr)        8.032e-06  1.000  0.000   0.493
    s(studynr,time)   8.264e-06  1.000  0.000   0.673
    
    R-sq.(adj) =  0.0123   Deviance explained = 2.63%
    fREML = 654.45  Scale est. = 1         n = 400
    
    draw(m)
    

    enter image description here

    and we can estimate the difference between the two estimated smooths (and plot it) with:

    dif <- difference_smooths(m, smooth = "s(time)")
    dif |> draw()
    

    enter image description here

    and now we can again predict from the model and plot as per the first example as you want these on the count scale:

    ds <- data_slice(m, time = evenly(time, n = N), category = evenly(category))
    fv <- fitted_values(m, data = ds,
                        exclude = c("s(studynr)", "s(studynr,time)"))
    fv |>
      ggplot(aes(x = time, y = fitted, group = category)) +
      geom_ribbon(aes(ymin = lower, ymax = upper, fill = category),
                  alpha = 0.2) +
      geom_line(aes(colour = category)) +
      labs(y = "Count")
    

    enter image description here

    Some notes:

    1. If this is a count response (as you describe) a Gaussian GAM is unlikely to be appropriate; in the second example I fitted a Poisson GAM, which would be the starting point for modelling counts.
    2. When fitting with bam() you need method = "fREML", otherwise you miss out on a lot of the speed/memory advantages of fitting with bam().
    3. The factor-by smooth in the second model is just one way you could fit this kind of thing.