Search code examples
rggplot2boxplot

Adding outliers to a boxplot from precomputed summary statistics


I am working with a large (50 x 800 000) sparse matrix (dgCMatrix) and want to plot a boxplot for the initial inspection of the data. This is a matrix of numeric items, with named rows (genes) and named columns (cells). The best solution I have found is to compute the relevant stats via sparseMatrixStats::rowQuantiles() and feed them directly to a boxplot geom.

I am aware of the approach for geom_boxplot() with precomputed values (this works seamlessly!), see links below, but I run into problems when trying to add outliers via an additional geom.

https://stackoverflow.com/questions/10628847/geom-boxplot-with-precomputed-values https://stackoverflow.com/questions/65426913/how-to-make-a-boxplot-from-summary-statistics-in-ggplot2 https://stackoverflow.com/questions/68341850/group-specified-geom-boxplot-from-summary-statistics-fails-to-generate-boxplots

In summary, I compute a data frame with relevant quantiles/summary statistics and feed them into geom_boxplot(). I also create a (still rather large) data frame with outliers, which I want to add onto the boxplot via geom_point() or geom_jitter() (as far as I am aware geom_boxplot() does not have a slot to add these in the precomputed approach). The problem arises when trying to add the outliers to the boxplot:

Error in `geom_point()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 2nd layer.
Caused by error in `check_aesthetics()`:
! Aesthetics must be either length 1 or the same as the data (3)
✖ Fix the following mappings: `y`
Run `rlang::last_trace()` to see where the error occurred.

This is a toy example of my problem:

genes=factor(c('a', 'b', 'c'))

# df with quantiles (simplified for brevity, with non-standard lower and upper hinges)
df <- data.frame(gene=genes, 
                 zero=c(1, 3, 0),
                 twentyfive=c(2, 4, 8),
                 fifty=c(5, 5, 12),
                 seventyfive=c(7, 9, 12),
                 hundred=c(8, 12, 15))

# Option 1 - only one outlier per gene 
df_outliers1 <- data.frame(gene=rep(genes, 1), 
                           value = sample(0:1, 3, replace = TRUE))

# Option 2 - more than one outlier per gene
df_outliers2 <- data.frame(gene=rep(genes, 1), 
                           value = c(sample(0:1, 3, replace = TRUE), sample(12:16, 3, replace=TRUE)))

# Option 1 - using df_outliers1 - works
ggplot(df, aes(x=gene, ymin=zero, lower=twentyfive, middle=fifty, upper=seventyfive, ymax=hundred)) + 
  geom_boxplot(stat='identity') + 
  geom_point(aes(y=df_outliers1$value)) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

# Option 2 - using df_outliers2 - error (as above)!
ggplot(df, aes(x=gene, ymin=zero, lower=twentyfive, middle=fifty, upper=seventyfive, ymax=hundred)) + 
  geom_boxplot(stat='identity') + 
  geom_point(aes(y=df_outliers2$value)) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))


Curiously, when there is exactly one outlier point per gene (Option 1, using df_outliers1), the approach above works perfectly well. But as soon as there are more points per gene (Option 2, using df_outliers2), the error occurs.

What is the best way to address this problem? (Or is there a better way of tackling the sparse matrix directly?)


Solution

  • Note that layers inherit aesthetics by default. If an aesthetic is not shared, don't specify it in the main ggplot() call. Also, avoid using "$" in aes() calls. Use data= with different data sources.

    Try

    ggplot(df, aes(x=gene)) + 
      geom_boxplot(aes(ymin=zero, lower=twentyfive, middle=fifty, upper=seventyfive, ymax=hundred), stat='identity') + 
      geom_point(aes(y=value), data=df_outliers2) + 
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
    

    enter image description here