Search code examples
rggplot2ggproto

Escape 'discrete aesthetic implies group' in custom stat


I'm trying to build a custom stat function with ggplot2 wherein I would like to access a discrete variable to compute a statistic with per group. However, the default behaviour of ggplot layers is to automatically assign implicit groups to any discrete variables (mostly). This means that my data gets split up over an automatic grouping, which I wouldn't want.

I can show this as follows; I have a pretty standard constructor:

library(ggplot2)

stat_example <- function(
  mapping = NULL,
  data = NULL,
  geom = "point",
  position = "identity",
  ...,
  na.rm = FALSE,
  show.legend = NA,
  inherit.aes = TRUE
) {
  layer(data = data,
        mapping = mapping,
        stat = StatExample,
        geom = geom,
        position = position,
        show.legend = show.legend,
        inherit.aes = inherit.aes,
        params = list(na.rm = na.rm))
}

And I have a Stat ggproto object that simply passes along the data, but prints the head of the data for illustration purposes. I've called the bit that I'm interested in for computing an actual stat value here.

StatExample <- ggproto(
  "StatExample",
  Stat,
  required_aes = c("x", "y", "value"),
  default_aes = aes(x = after_stat(x), y = after_stat(y)),
  compute_group = function(data, scales) {
    print(head(data, 2))
    data
  }
)

Now if I construct a plot with this stat, we can see what goes into the compute_group() function as data.

g <- ggplot(iris) +
  stat_example(aes(Sepal.Width, Sepal.Length, value = Species))

# To get only the print side-effect, not the plot (which looks normal)
g <- ggplotGrob(g)
#>     x   y  value PANEL group
#> 1 3.5 5.1 setosa     1     1
#> 2 3.0 4.9 setosa     1     1
#>      x   y      value PANEL group
#> 51 3.2 7.0 versicolor     1     2
#> 52 3.2 6.4 versicolor     1     2
#>       x   y     value PANEL group
#> 101 3.3 6.3 virginica     1     3
#> 102 2.7 5.8 virginica     1     3

Created on 2020-05-28 by the reprex package (v0.3.0)

I would like to have 1 data.frame containing all the data for this case. We see above that we've printed 3 data.frames with different group variables, meaning that the data has been split into 3 groups. What I think it would take to get there, is to have the value variable escape the automatic group detection.

I've considered the following points:

  • I could let the group default to -1, which is the standard 'no group'-group. However, when I do this, the data will not get automatically grouped when for example aes(colour = some_variable). This I definitively want to happen.
  • Looking at ggplot2:::add_group() function, it seems I can escape the autogrouping by naming my value variable label, however this would make the stat incompatible with geom_text() and it doesn't describe the meaning of value naturally.
  • I could replace the layer() call with a variant of this function, that would make a different Layer ggproto object wherein compute_aesthetics() works out groups differently. This however is a lot of work I would rather prevent to be burdened with.
  • I could probably pull a trick along the lines of vctrs::new_vctr(..., class = "not_discrete"), but where is the appropriate place to wrap my value variable in that class?

Helpful suggestions are welcome, or new takes on 'just use label' arguments too.


Solution

  • If this is an occasional use case, a simple (albeit manual) hack could be running trace(ggplot2:::add_group, edit = TRUE) and add "value" alongside "label", "PANEL" as variable names to be excluded from automatic group detection.

    A less manual (but probably more fragile) way to achieve the same effect would involve the following steps:

    1. Define a modified version of the add_group function with the above modification;
    2. Define a modified version of the Layer ggproto object that uses the modified add_group in its compute_aesthetics function;
    3. Point the custom stat to the modified layer.
    # define modified add_group function
    add_group2 <- function (data) {
      if (ggplot2:::empty(data)) 
        return(data)
      if (is.null(data$group)) {
        disc <- vapply(data, ggplot2:::is.discrete, logical(1))
        disc[names(disc) %in% c("label", "PANEL", "value")] <- FALSE         # change here
        if (any(disc)) {
          data$group <- vctrs::vec_group_id(data[disc])
        }
        else {
          data$group <- ggplot2:::NO_GROUP
        }
      } else {
        data$group <- vctrs::vec_group_id(data["group"])
      }
      data
    }
    
    # define modified compute_aesthetics function that uses modified add_group in second last line
    compute_aesthetics_alt <- .subset2(ggplot2:::Layer, "compute_aesthetics")
    body(compute_aesthetics_alt)[[length(body(compute_aesthetics_alt)) - 1]] <- 
      quote(evaled <- add_group2(evaled))
    
    # define modified Layer ggproto object that uses alternative compute_aesthetics
    Layer2 <- ggproto("Layer2",
                      ggplot2:::Layer,
                      compute_aesthetics = compute_aesthetics_alt)
    
    # define modified stat with Layer2 specified as its layer_class
    stat_example <- function(
      mapping = NULL,
      data = NULL,
      geom = "point",
      position = "identity",
      ...,
      na.rm = FALSE,
      show.legend = NA,
      inherit.aes = TRUE
    ) {
      layer(data = data,
            mapping = mapping,
            stat = StatExample,
            geom = geom,
            position = position,
            show.legend = show.legend,
            inherit.aes = inherit.aes,
            params = list(na.rm = na.rm),
            layer_class = Layer2) # change here
    }
    

    Usage:

    # add new column to simulate different colour
    iris$gg <- sample(c("a", "b"), size = nrow(iris), replace = TRUE) 
    
    ggplot(iris) + 
      stat_example(aes(Sepal.Width, Sepal.Length,
                       value = Species))
    # prints one data frame, because there's only one group by default
    
    ggplot(iris) + 
      stat_example(aes(Sepal.Width, Sepal.Length,
                       value = Species, colour = gg))
    # prints two data frames, because grouping is based on the colour aesthetic,
    # which has two possible values