Search code examples
rggplot2plotrectangles

How to find the start and the end of sequences automatically in R for rectangles in ggplot


I am trying to plot some data with some shadow rectangles.

The dataframe df looks like this:

df <- data.frame(time  = seq(0.1, 2, 0.1), 
                 speed = c(seq(0.5, 5, 0.5), seq(5, 0.5, -0.5)),
                 type  = c("a", "a", "b", "b", "b", "b", "c", "c", "c", "b", "b", "b", "b", "b", "c", "a", "b", "c", "b", "b"))

For the rectangles in the plot I am defining an object called dfRect with the variables xmin and xmax.

dfRect <- data.frame(xmin = c(0.3, 1.0, 1.9), xmax = c(0.7, 1.5, 2.0))

The problem is I have to find xmin and xmax manually for the start and the end of the rectangles. A rectangle starts (xmin) at the start of a time sequence of b in column type and ends at the end of the same time sequence of b. Single bs can be ignored.

Here is the plot so you get an Idea what I am trying to accomplish:

ggplot() +
  geom_rect(data = dfRect, 
            aes(xmin = xmin, xmax = xmax, ymin = -Inf, ymax = Inf), 
            fill = "yellow", alpha = 0.4) +
  geom_line(data = df, aes(x = time, y = speed, color = type, group = 1), size = 3)

So in the end the question is. How can I automate the process of defining xmin and xmax and create dfRect automatically so I don't have to define it by myself?


Solution

  • Here is an approach using run length encoding.

    library(ggplot2)
    
    df <- data.frame(time  = seq(0.1, 2, 0.1), 
                     speed = c(seq(0.5, 5, 0.5), seq(5, 0.5, -0.5)),
                     type  = c("a", "a", "b", "b", "b", "b", "c", "c", "c", "b", "b", "b", "b", "b", "c", "a", "b", "c", "b", "b"))
    
    # Convert to runlength encoding
    rle <- rle(df$type == "b")
    
    # Ignoring the single "b"s
    rle$values[rle$lengths == 1 & rle$values] <- FALSE
    
    # Determine starts and ends
    starts <- {ends <- cumsum(rle$lengths)} - rle$lengths + 1
    
    # Build a data.frame from the rle
    dfrect <- data.frame(
      xmin = df$time[starts],
      # We have to +1 the ends, because the linepieces end at the next datapoint
      # Though we should not index out-of-bounds, so we need to cap at the last end
      xmax = df$time[pmin(ends + 1, max(ends))],
      fill = rle$values
    )
    

    This plot gives an idea what we've been doing in the code above:

    ggplot() +
      geom_rect(data = dfrect, 
                aes(xmin = xmin, xmax = xmax, ymin = -Inf, ymax = Inf, fill = fill), 
                alpha = 0.4) +
      geom_line(data = df, aes(x = time, y = speed, color = type, group = 1), size = 3)
    

    To get what you want you'd need to filter out the FALSEs.

    
    ggplot() +
      geom_rect(data = dfrect[dfrect$fill,], 
                aes(xmin = xmin, xmax = xmax, ymin = -Inf, ymax = Inf), 
                alpha = 0.4, fill = "yellow") +
      geom_line(data = df, aes(x = time, y = speed, color = type, group = 1), size = 3)
    

    If you are looking for a stat that can calculate this for you, have a look here. Disclaimer: I wrote that function, which does a similar thing to the code I posted above.