Is there a smart way to only keep the n largest groups (counts) when creating a boxplot?
library(tidyverse)
head(mpg)
# A tibble: 6 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
mpg %>%
count(manufacturer, sort=TRUE)
# A tibble: 15 x 2
manufacturer n
<chr> <int>
1 dodge 37
2 toyota 34
3 volkswagen 27
4 ford 25
5 chevrolet 19
6 audi 18
7 hyundai 14
8 subaru 14
9 nissan 13
10 honda 9
11 jeep 8
12 pontiac 5
13 land rover 4
14 mercury 4
15 lincoln 3
Here is a plot. I would like to e.g. only keep the first 5 manufacturers from the above table.
mpg %>% ggplot()+
geom_boxplot(aes(displ, manufacturer))
What you need to do is to extract N wanted manufactures before the ggplot
call and pass them into scale_y_discrete(limits = ...)
(limits
will subset wanted variables and plot only them).
library(tidyverse)
nWanted <- 5
foo <- head(count(mpg, manufacturer, sort = TRUE), nWanted)$manufacturer
# [1] "dodge" "toyota" "volkswagen" "ford" "chevrolet"
ggplot(mpg) +
geom_boxplot(aes(displ, manufacturer)) +
scale_y_discrete(limits = foo)
More correct solution would be to (ie, pass categorical variable to x axis and then flip coords):
ggplot(mpg) +
geom_boxplot(aes(manufacturer, displ)) +
coord_flip() +
scale_x_discrete(limits = foo)