Search code examples
rggplot2smoothing

How create gaps in smoother for "missing" values (R, ggplot)


If I have a data set like this

set.seed(100)
data <- data.frame("x" = c(1, 1, 1, 2, 2, 2, 3, 4, 4, 4, 5, 5, 5),
                   "y" = rnorm(13),
                   "factor" = c("a","b","c","a","b", "c", "c", "a",
                                "b", "c", "a", "b","c"))

so it looks like this

   x           y factor
1  1 -0.50219235      a
2  1  0.13153117      b
3  1 -0.07891709      c
4  2  0.88678481      a
5  2  0.11697127      b
6  2  0.31863009      c
7  3 -0.58179068      c
8  4  0.71453271      a
9  4 -0.82525943      b
10 4 -0.35986213      c
11 5  0.08988614      a
12 5  0.09627446      b
13 5 -0.20163395      c

I would like to plot this with a separate smoother each factor (a,b,c)

library(ggplot2)
ggplot(data = data, aes(x = x, y = y, col = factor)) + 
  geom_smooth(aes(group = factor))

However since there are no values for "a" and "b" for x = 3, so I would like the smoothers for "a" and "b" to have a break for x = 3. What's the best strategy to accomplish that?


Solution

  • I would create an expansion of the combinations of x and factor and then do a database-like join on the combinations and the data. For example, first I form a new data frame df with the combinations of the unique values of x and factor

    df <- expand.grid(sapply(data[, c("x", "factor")], unique))
    
    > df
       x factor
    1  1      a
    2  2      a
    3  3      a
    4  4      a
    5  5      a
    6  1      b
    7  2      b
    8  3      b
    9  4      b
    10 5      b
    11 1      c
    12 2      c
    13 3      c
    14 4      c
    15 5      c
    

    Then we can simply perform a join operation on the df and your data, requesting that we return all the rows from the left hand side (the x argument, hence df), and include corresponding values for y from the right hand side (data). Where there is no corresponding right hand side (in data, we will get an NA.

    newdf <- merge(df, data, all.x = TRUE)
    
    > newdf
       x factor           y
    1  1      a -0.50219235
    2  1      b  0.13153117
    3  1      c -0.07891709
    4  2      a  0.88678481
    5  2      b  0.11697127
    6  2      c  0.31863009
    7  3      a          NA
    8  3      b          NA
    9  3      c -0.58179068
    10 4      a  0.71453271
    11 4      b -0.82525943
    12 4      c -0.35986213
    13 5      a  0.08988614
    14 5      b  0.09627446
    15 5      c -0.20163395
    

    Now we can fit and predict from a loess model by hand, but this is a little tedious - easier options are available via mgcv:gam()

    loessFun <- function(XX, span = 0.85) {
      fit <- loess(y ~ x, data = XX, na.action = na.exclude, span = span)
      predict(fit)
    }
    

    Now split the data by factor and apply the loessFun() wrapper

    fits <- lapply(split(newdf, newdf$factor), loessFun)
    newdf <- transform(newdf, fitted = unsplit(fits, factor))
    
    > head(newdf)
      x factor           y      fitted
    1 1      a -0.50219235 -0.50219235
    2 1      b  0.13153117  0.13153117
    3 1      c -0.07891709 -0.07891709
    4 2      a  0.88678481  0.88678481
    5 2      b  0.11697127  0.11697127
    6 2      c  0.31863009  0.31863009
    

    We can then plot using the new data frame

    ggplot(newdf, aes(x = x, y = y, col = factor)) + 
      geom_line(aes(group = factor))
    

    which gives:

    enter image description here

    It looks a bit funky because of the very low resolution of the sample data you provided and because this method that I've used predicts at the observed data only, preserving NAs. geom_smooth() is actually predicting over the range of x for each group separately and as such there are no missing xs in the data used to draw the geom layer.

    Unless you can explain within what region of x = 3 we should add a break (an NA), this may well be the best that you can do. Alternatively, we could predict over the region from the models and then set anything 2.5 < x < 3.5 back to being NA. Add a comment if that is what you wanted and I'll expand my answer with an example of doing that if you can indicate how we are to envisage the gaps.