Search code examples
rggplot2dplyrforcats

Keep n largest groups in geom_boxplot


Is there a smart way to only keep the n largest groups (counts) when creating a boxplot?

library(tidyverse)

head(mpg)

# A tibble: 6 x 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~

mpg %>% 
  count(manufacturer, sort=TRUE)

# A tibble: 15 x 2
   manufacturer     n
   <chr>        <int>
 1 dodge           37
 2 toyota          34
 3 volkswagen      27
 4 ford            25
 5 chevrolet       19
 6 audi            18
 7 hyundai         14
 8 subaru          14
 9 nissan          13
10 honda            9
11 jeep             8
12 pontiac          5
13 land rover       4
14 mercury          4
15 lincoln          3

Here is a plot. I would like to e.g. only keep the first 5 manufacturers from the above table.

mpg %>% ggplot()+
  geom_boxplot(aes(displ, manufacturer))

enter image description here


Solution

  • What you need to do is to extract N wanted manufactures before the ggplot call and pass them into scale_y_discrete(limits = ...) (limits will subset wanted variables and plot only them).

    library(tidyverse)
    
    nWanted <- 5
    foo <- head(count(mpg, manufacturer, sort = TRUE), nWanted)$manufacturer
    # [1] "dodge"      "toyota"     "volkswagen" "ford"       "chevrolet"     
    
    ggplot(mpg) +
        geom_boxplot(aes(displ, manufacturer)) +
        scale_y_discrete(limits = foo)
    

    enter image description here

    More correct solution would be to (ie, pass categorical variable to x axis and then flip coords):

    ggplot(mpg) +
        geom_boxplot(aes(manufacturer, displ)) +
        coord_flip() +
        scale_x_discrete(limits = foo)