Search code examples
rggplot2cluster-analysismgcv

Combine gam smooths of multiple clusters in an mgcViz plot


I have gam smooths for several clusters coming from a model-based clustering code combining unequal length time series and I would like to display them along with the data.

The mgcViz package provides excellent visualizations for individual clusters but I don't see how to combine them. Perhaps it's because it aims at visualizing several effects rather than several clusters. Nevertheless, its capability is very close to what I need, so here is a reproducible example (adapted from https://mfasiolo.github.io/mgcViz/articles/mgcviz.html):

library(mgcViz)
n = 1e3
z = rnorm(n)
dat = data.frame(x = rep(z, times = 2),
                 y = rep(c(1,2), each = n) + c(sin(z), 0.5*z^2) + rnorm(2*n)/4,
                 g = factor(rep(1:2, each = n)))

b <- lapply(1:2, function(i, dat) gam(y ~ s(x), data = dat[dat$g == i, ]),
            dat = dat)

plot(getViz(b[[1]])) + l_points() + l_fitLine() + l_ciLine()   # First
plot(getViz(b[[2]])) + l_points() + l_fitLine() + l_ciLine()   # Second
plot(getViz(b))   # Third
ggplot(dat, aes(x, y, color = g)) + geom_point(pch = ".") + theme_bw() # Fourth

First plotSecond plot Third plotFourth plot

I would like to combine the first two plots into one as is partially done in the third plot. Putting the third plot into the data shown in the fourth would do the trick. This also needs different intercept shifts in the third plot fits. Adding l_points() to the third plot renders it empty.

A hidden constraint is that the gam smooths are separate list components (as they are above) because they are actually coming from a custom clustering code for time series snippets of unequal length and spacing using mgcv's bam for very large data. So the plotting should preferably get all its information from b, a list gam results for each cluster.


Solution

  • Not mgcViz, but you can simply create the needed output yourself and use ggplot2:

    library(mgcv)
    library(ggplot2)
    theme_set(theme_bw())
    
    n = 1e3
    z = rnorm(n)
    dat = data.frame(
      z = rep(z, times = 2),
      y = rep(c(1,2), each = n) + c(sin(z), 0.5*z^2) + rnorm(2*n)/4,
      g = factor(rep(1:2, each = n)))
    
    b <- gam(y ~ g + s(z, by = g), data = dat)
    
    ndf <- expand.grid(z = seq(min(dat$z), max(dat$z), length.out=100), g = unique(dat$g))
    ndf$pred <- predict(b, newdata = ndf, type = "response")
    
    ggplot(ndf, aes(x = z, y = pred, col = g)) +
      geom_line() +
      geom_point(data = dat, aes(y = y))
    

    Created on 2021-01-28 by the reprex package (v0.3.0)

    Edit

    If you want to fit separate models per group, could do something like this:

    library(purrr)
    ndf <- map_dfr(
      .x = unique(dat$g),
      .f = ~{
        mod_i <- gam(y ~ s(x), data = dat[dat$g == .x, ])
        ndf_i <- expand.grid(x = seq(min(dat$x), max(dat$x), length.out = 100))
        ndf_i$g <- .x
        ndf_i$pred <- predict(mod_i, newdata = ndf_i, type = "response")
        ndf_i
      })
    
    ggplot(ndf, aes(x = x, y = pred, col = g)) +
      geom_line() +
      geom_point(data = dat, aes(y = y))