Search code examples
rggplot2binning

ggplot stat_summary_bin glitch?


I was happy to discover that ggplot has binned scatter plots, which are useful for exploring and visualizing relationships in large data. Yet the top bin appears to misbehave. Here's an example: All bin averages are roughly linearly aligned, as they should be, but the top one is off on both dimensions:

enter image description here

the code:

library(ggplot2)

# simulate an example of linear data 
set.seed(1)
N <- 10^4
x <- runif(N)
y <- x + rnorm(N)
dt <- data.frame(x=x, y=y)

ggplot(dt, aes(x, y)) + 
  geom_point(alpha = 0.1, size = 0.01) +
  stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point')

is there a simple workaround (and where should this be posted)?


Solution

  • stat_summary_bin is actually excluding the two rows with the largest x-values from the bins, and those two values are ending up with bin = NA. The mean of those two excluded values is plotted as a separate bin to the right of the regular bins. First, I show what is going wrong in your original plot then I provide a workaround to get the desired behavior.

    What's going wrong in the original plot

    To see what's going wrong in your original plot, create a plot with two calls to stat_summary_bin where we calculate the mean of each bin and the number of values in each bin. Then use ggplot_build to capture all of the internal data that ggplot generated to create the plot.

    p1 = ggplot(dt, aes(x, y)) + 
      geom_point(alpha = 0.1, size = 0.01) +
      stat_summary_bin(fun.y=mean, bins=10, size=5, geom='text',
                       aes(label=..y..)) +
      stat_summary_bin(fun.y=length, bins=10, size=5, geom='text',
                       aes(label=..y.., y=0)) 
    
    p1b = ggplot_build(p1)
    

    Now let's look at the data for the mean and length layers, respectively. I've printed only bins 9 through 11 (the three right-most bins) for brevity. Bin 11 is the "extra" bin and you can see that it contains only 2 values (its label is 2 in the second table below), and that the mean of those two values is -0.1309998, as can be seen in the first table below.

    p1b$data[[2]][9:11,c(1,2,4,6,7)]
    
            label bin          y         x      width
    9   0.8158320   9  0.8158320 0.8498505 0.09998242
    10  0.9235531  10  0.9235531 0.9498329 0.09998242
    11 -0.1309998  11 -0.1309998 1.0498154 0.09998244
    
    p1b$data[[3]][9:11,c(1,2,4,6,7)]
    
       label bin    y         x      width
    9   1025   9 1025 0.8498505 0.09998242
    10  1042  10 1042 0.9498329 0.09998242
    11     2  11    2 1.0498154 0.09998244
    

    Which two values are those? It looks like they come from the two rows with the highest x values in the original data frame:

    mean(dt[order(-dt$x), "y"][1:2]) 
    
    [1] -0.1309998
    

    I'm not sure how stat_summary_bin is managing to bin the data such that the two highest x values are excluded.

    Workaround to get the desired behavior

    A workaround is to summarize the data yourself, so you'll have complete control over how the bins are created. The example below uses your original code and then plots pre-summarized values in blue, so you can compare the behavior. I've included the dplyr package so that I can use the chaining operator (%>%) to summarize the data on the fly:

    library(dplyr)
    
    ggplot(dt, aes(x, y)) + 
      geom_point(alpha = 0.1, size = 0.01) +
      stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point') +
      geom_point(data=dt %>% 
                   group_by(bins=cut(x,breaks=seq(min(x),max(x),length.out=11), include.lowest=TRUE)) %>%
                   summarise(x=mean(x), y=mean(y)),
                 aes(x,y), size=3, color="blue") +
      theme_bw()
    

    enter image description here