What does "training" the data mean in the internals of ggplot2?

I'm following along with the internals of the ggplot2 library and I'm trying to understand how non-positional aesthetics get mapped to the values that get passed to grid. The book describes this process as

The last part of the data transformation is to train and map all non-positional aesthetics, i.e. convert whatever discrete or continuous input that is mapped to graphical parameters such as colours, linetypes, sizes etc."

However, this is the first time that the idea of "training" the data appears in the text.

The code for this process (from ggplot2:::ggplot_build.ggplot) appears to be:

  # Train and map non-position scales and guides
  npscales <- scales$non_position_scales()
  if (npscales$n() > 0) {
    lapply(data, npscales$train_df)
    plot$guides <- plot$guides$build(npscales, plot$layers, plot$labels, data)
    data <- lapply(data, npscales$map_df)
  } else {
    # Only keep custom guides if there are no non-position scales
    plot$guides <- plot$guides$get_custom()
  }

but I'm unable to follow along with what's actually happening here. Does the lapply(data, npscales$train_df) actually do anything? It doesn't seem to be saved and I would've expected it to be data <- lapply(data, npscales$train_df) instead, but the function seems to always return NULLs no matter what plot I try it with.

What does "training" non-positional data mean in the ggplot2 package?

Solution

In ggplot2 terms, 'training' means keeping track of possible values. For continuous variables, this means keeping track of the range and for discrete variables, that means keeping track of the levels. 'Keeping track' here means to go over every layer's data and update the possible values based on the values encountered in the data.

Under the hood, this is all orchestrated by {scale's} DiscreteRange and ContinuousRange classes. See below for examples how these are updated.

# At first, tracked variable is empty
range <- scales::DiscreteRange$new()
range$range
#> NULL

# Observe data in first layer
range$train(c("A", "X"))
range$range
#> [1] "A" "X"

# Observe data in second layer
range$train(c("B"))
range$range
#> [1] "A" "B" "X"

For continuous ranges.

# Again empty at first
range <- scales::ContinuousRange$new()
range$range
#> NULL

# Observe data in first layer
range$train(c(0, 10))
range$range
#> [1]  0 10

# Observe data in second layer
range$train(c(100))
range$range
#> [1]   0 100

^{Created on 2024-07-18 with reprex v2.1.1}

In the code you present, lapply(data, npscales$train_df) is doing this job. The train_df method is called for the side effect of updating the scale's ranges and returns NULL as it does not alter the data itself and the function result is not needed.

The 'non-positional' part means that the x and y aesthetics (and related ones such as xmin, yend) don't participate as they need special treatment and be trained much earlier in the plot building process.