Search code examples
rggplot2facet-wrapgam

ggplot fails to draw smooth gam using facet_wrap and group asthetic


I am trying to plot a multi-panel and multi-line plot using ggplot with a group aesthetic as well as facet_wrap. However, the geom_smooth fails for all lines in the facet plot when one group has too few data points.

plot1 <- ggplot(data=df1, 
                aes(x=Year, y=Mean, group=Group2, linetype=Group2, shape=Group2)) +  
  geom_errorbar(aes(ymin=Mean-SE, ymax=Mean+SE), width=0.2) +  
  geom_smooth(method = "gam", se=F, formula = y ~ s(x, k=3), size = 1, colour="black") + 
  geom_point(position=pd, size=2, fill="white") +  
  scale_x_continuous(limits=c(min(df1$Year-0.1), max(df1$Year+0.1)), 
                     breaks=seq(min(df1$Year),max(df1$Year),5)) +  
  facet_wrap(~Group1, scales = "free", ncol=2) +  
  theme_bw() + 
  theme(axis.text.x = element_text(),
        axis.title.x = element_blank(), 
        strip.background = element_blank(), 
        axis.line.x = element_line(colour="black"),
        axis.line.y = element_line(colour="black"), 
        panel.grid.minor = element_blank(), 
        panel.grid.major = element_blank(),
        panel.border = element_blank(), 
        panel.background = element_blank(),
        legend.position="top",
        legend.title = element_blank())
plot(plot1)

to produce plot the following plot. This is only summary data to make it easier. It's as if the error stops ggplot from computing the series smooths in that particular facet.

data

Year    Group1      Group2      Mean        SE
2011    Factor A    Factor C    30.62089116 3.672624771
2011    Factor A    Factor D    54.99066324 2.822405771
2011    Factor B    Factor C    30.48859003 3.748388489
2011    Factor B    Factor D    45.70410611 4.284244405
2017    Factor A    Factor C    33.68256601 4.030964172
2017    Factor A    Factor D    53.43496462 4.687042033
2017    Factor B    Factor C    23.08799875 5.17753488
2001    Factor A    Factor C    23.79166667 2.837795432
2001    Factor A    Factor D    23.75925926 3.688185081
2001    Factor B    Factor C    29.05555556 4.08597798
2001    Factor B    Factor D    28.13333333 7.877429079
2008    Factor A    Factor C    23.3        2.383624691
2008    Factor A    Factor D    28.83333333 2.750959429
2008    Factor B    Factor C    34.01666667 5.340999698

and plot

plot 1

Clearly there is enough data to draw a smooth for factorC line in group factorB. Any idea?


Solution

  • I think this is rather tricky. After some testing & reading through the current GH code for StatSmooth, I've summarised my findings as follows:

    Observations

    1. geom_smooth() fails to draw all smoothed lines in a plot panel, if any of the data groups has too few observations for method = "gam" AND formula = y ~ s(x, k = 3);
    2. If the plot is faceted into multiple panels, only the panel(s) with offending data groups are affected;
    3. This does not happen for formula = y ~ x (i.e. default formula);
    4. This does not happen for some other methods (e.g. "lm", "glm") with the default formula, but does occur with method = "loess";
    5. This does not happen if the data group has only 1 observation.

    We can reproduce the above, with some simplified code:

    # create sample data
    n <- 30
    set.seed(567)
    df.1 <- data.frame( # there is only 1 observation for group == B
      x = rnorm(n), y = rnorm(n),
      group = c(rep("A", n - 1), rep("B", 1)),
      facet = sample(c("X", "Y"), size = n, replace = TRUE))    
    set.seed(567)
    df.2 <- data.frame( # there are 2 observations for group == B
      x = rnorm(n), y = rnorm(n),
      group = c(rep("A", n - 2), rep("B", 2)),
      facet = sample(c("X", "Y"), size = n, replace = TRUE))
    
    # create base plot
    p <- ggplot(df.2, aes(x = x, y = y, color = group)) + 
      geom_point() + theme_bw()
    
    # problem: no smoothed line at all in the entire plot
    p + geom_smooth(method = "gam", formula = y ~ s(x, k = 3))
    
    # problem: no smoothed line in the affected panel
    p + facet_wrap(~ facet) + 
      geom_smooth(method = "gam", formula = y ~ s(x, k = 3))
    
    # no problem with default formula: smoothed lines in both facet panels
    p + facet_wrap(~ facet) + geom_smooth(method = "gam")
    
    # no problem with lm / glm, but problem with loess
    p + facet_wrap(~ facet) + geom_smooth(method = "lm")
    p + facet_wrap(~ facet) + geom_smooth(method = "glm")
    p + facet_wrap(~ facet) + geom_smooth(method = "loess")
    
    # no problem if there's only one observation (instead of two)
    p %+% df.1 + geom_smooth(method = "gam", formula = y ~ s(x, k = 3))
    p %+% df.1 + facet_wrap(~ facet) + 
      geom_smooth(method = "gam", formula = y ~ s(x, k = 3))
    

    Explanation for observations 1 & 2:

    I believe the problem lies in the last two lines within StatSmooth's compute_group function. The first line calls the model function (e.g. stats::glm, stats::loess, mgcv::gam) on data frame for each group specified by the aes(group = ...) mapping, while the second line calls one of the wrappers around stats::predict() to get the smoothed values (and confidence interval, if applicable) for the model.

    model <- do.call(method, c(base.args, method.args))
    predictdf(model, xseq, se, level)
    

    When the parameters method = "gam", formula = y ~ s(x, k = 3) are used for the dataframe with only 2 observations, this is what happens:

    model <- do.call(mgcv::gam,
                     args = list(formula = y ~ s(x, k = 3),
                                 data = df.2 %>% filter(group == "B" & facet == "X")))
    

    Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : A term has fewer unique covariate combinations than specified maximum degrees of freedom

    model, the object defined to take on the result of do.call(...), hasn't even been created. The last line of code predictdf(...) will throw an error because model doesn't exist. Without faceting, this affects all the computations done by StatSmooth, and geom_smooth() receives no usable data to create any geom in its layer. With faceting, the above computations are done separately for each facet, so only the facet(s) with problematic data are affected.

    Explanation for observations 3 & 4:

    Adding on to the above, if we don't specify a formula to replace the default y ~ x, we will get a valid model object from gam / lm / glm, which can be passed to ggplot2's unexported predictdf function for a dataframe of prediction values:

    model <- do.call(mgcv::gam, # or stats::lm, stats::glm
                     args = list(formula = y ~ x,
                                 data = df.2 %>% filter(group == "B" & facet == "X")))
    
    result <- ggplot2:::predictdf(
      model, 
      xseq = seq(-2, 1.5, length.out = 80), # pseudo range of x-axis values
      se = FALSE, level = 0.95) # default SE / level parameters
    

    loess will return a valid object as well, albeit with loads of warnings. However, passing it to predictdf will result in an error:

    model <- do.call(stats::loess,
                     args = list(formula = y ~ x,
                                 data = df.2 %>% filter(group == "B" & facet == "X")))
    
    result <- ggplot2:::predictdf(
      model, 
      xseq = seq(-2, 1.5, length.out = 80), # pseudo range of x-axis values
      se = FALSE, level = 0.95) # default SE / level parameters
    

    Error in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : NA/NaN/Inf in foreign function call (arg 5)

    Explanation for observation 5:

    StatSmooth's compute_group function starts with the following:

    if (length(unique(data$x)) < 2) {
          # Not enough data to perform fit
          return(data.frame())
        }
    

    In other words, if there is only 1 observation in a specified group, StatSmooth immediately returns a blank data frame. Hence, it will never reach the subsequent parts of the code to throw any error.

    Workaround:

    Having pinpointed where things were going off the tracks, we can make tweaks to the compute_group code (see annotated & commented out portions):

    new.compute_group <- function(
      data, scales, method = "auto", formula = y~x, se = TRUE, n = 80, span = 0.75, 
      fullrange = FALSE, xseq = NULL, level = 0.95, method.args = list(), na.rm = FALSE) {
      if (length(unique(data$x)) < 2) return(data.frame()) 
      if (is.null(data$weight)) data$weight <- 1
      if (is.null(xseq)) {
        if (is.integer(data$x)) {
          if (fullrange) {
            xseq <- scales$x$dimension()
          } else {
            xseq <- sort(unique(data$x))
          }
        } else {
          if (fullrange) {
            range <- scales$x$dimension()
          } else {
            range <- range(data$x, na.rm = TRUE)
          }
          xseq <- seq(range[1], range[2], length.out = n)
        }
      }
      if (identical(method, "loess")) method.args$span <- span 
      if (is.character(method)) method <- match.fun(method)
      base.args <- list(quote(formula), data = quote(data), weights = quote(weight))
    
      # if modelling fails, return empty data frame
      # model <- do.call(method, c(base.args, method.args))
      model <- try(do.call(method, c(base.args, method.args)))
      if(inherits(model, "try-error")) return(data.frame())
    
      # if modelling didn't fail, but prediction returns NA,
      # also return empty data frame
      # predictdf(model, xseq, se, level)
      pred <- try(ggplot2:::predictdf(model, xseq, se, level))
      if(inherits(pred, "try-error")) return(data.frame())
      return(pred)
    }
    

    Define a new stat layer that uses this version:

    # same as stat_smooth() except that it uses stat = StatSmooth2, rather 
    # than StatSmooth
    stat_smooth_local <- function(
      mapping = NULL, data = NULL, geom = "smooth", position = "identity", ...,
      method = "auto", formula = y ~ x, se = TRUE, n = 80, span = 0.75,
      fullrange = FALSE, level = 0.95, method.args = list(), na.rm = FALSE,
      show.legend = NA, inherit.aes = TRUE) {
      layer(
        data = data, mapping = mapping, stat = StatSmooth2,
        geom = geom, position = position, show.legend = show.legend,
        inherit.aes = inherit.aes,
        params = list(
          method = method, formula = formula, se = se, n = n,
          fullrange = fullrange, level = level, na.rm = na.rm,
          method.args = method.args, span = span, ...
        )
      )
    }
    
    # inherit from StatSmooth
    StatSmooth2 <- ggproto(
      "StatSmooth2", ggplot2::StatSmooth,
      compute_group = new.compute_group
    )
    

    Result:

    We can run through the same cases as before, replacing geom_smooth() with stat_smooth_local(), & verify that the smoothed geom layers are visible in every case (note that some will still result in error messages):

    # problem resolved: smoothed line for applicable group in the entire plot
    p + stat_smooth_local(method = "gam", formula = y ~ s(x, k = 3))
    
    # problem resolved: smoothed line for applicable group in the affected panel
    p + facet_wrap(~ facet) + 
      stat_smooth_local(method = "gam", formula = y ~ s(x, k = 3))
    
    # still no problem with default formula
    p + facet_wrap(~ facet) + stat_smooth_local(method = "gam")
    
    # still no problem with lm / glm; problem resolved for loess
    p + facet_wrap(~ facet) + stat_smooth_local(method = "lm")
    p + facet_wrap(~ facet) + stat_smooth_local(method = "glm")
    p + facet_grid(~ facet) + stat_smooth_local(method = "loess")
    
    # still no problem if there's only one observation (instead of two)
    p %+% df.1 + stat_smooth_local(method = "gam", formula = y ~ s(x, k = 3))
    p %+% df.1 + facet_wrap(~ facet) + 
      stat_smooth_local(method = "gam", formula = y ~ s(x, k = 3))
    
    # showing one pair of contrasts here
    cowplot::plot_grid(
      p + facet_wrap(~ facet) + ggtitle("Before") +
        geom_smooth(method = "gam", formula = y ~ s(x, k = 3)),
      p + facet_wrap(~ facet) + ggtitle("After") +
        stat_smooth_local(method = "gam", formula = y ~ s(x, k = 3)),
      nrow = 2
    )
    

    plot