How to subset with ggplot based on facet aggregates?

I have a dataset with 16 groups available for facets—however, that is too many, and I'd like to keep only the most important groups (determined by what percentage of a certain total falls in that group). For example, I'd like to keep only groups that represent 30% or more of the total of Var1.

To illustrate, if I run the following code, R correctly outputs the two species whose Petal.length sum represents more than 30% of the total Petal.length in the dataset (ignore that it's a meaningless statistic in this case).

library(tidyverse)

iris %>% 
  group_by(Species) %>% 
  summarise(t_length = sum(Petal.Length),
            p_length = round(100*t_length/sum(.$Petal.Length))) %>% 
  filter(p_length >=30)

So, what I'd like to do is have ggplot facet by all groups that meet the specified condition. In my dataset, only 5 out of the 16 groups capture over 90% of the interesting observations, so, I don't need the other 11 groups in the facet grid.

This is my attempt, and the output is all 3 species, where it should only be the same 2 from the table above:

iris.sub <- ggplot(subset(iris, round(100*sum(Petal.Length)/sum(iris$Petal.Length)) >= 30), aes(x = ' ', y = Petal.Length)) +
  geom_point(stat = 'summary', fun.y = 'mean') +
  geom_errorbar(stat = 'summary', fun.data = 'mean_se', 
                width=0, fun.args = list(mult = 1.96)) +
  facet_grid( . ~ Species ) +
  theme_bw()
iris.sub

Solution

filter won't be affected by group_by. For example, if you have a data frame grouped by a column var1 and you want to filter for rows with the column x > 50, the fact that an observation is in a certain group doesn't affect the fact that a number is or isn't greater than 50.

Here are two ways to do it with some dplyr functions. The first calculates the share each group contributes to the total petal length, pulls out those species, and keeps that as a vector. Then you filter the data frame for just observations with one of those species, and plot.

The second does those calculations and plotting all in one block. The advantage to this is that you don't have to save a variable for the species you're keeping; the disadvantage is that doing summary math in mutate calls rather than summarise is messy and can lead to errors if you're not careful with exactly what you need to add up (saying this from experience).

library(tidyverse)

major_categories <- iris %>%
  group_by(Species) %>%
  summarise(group_Petal.Length = sum(Petal.Length)) %>%
  mutate(share_Petal.Length = group_Petal.Length / sum(group_Petal.Length)) %>%
  filter(share_Petal.Length >= 0.3) %>%
  pull(Species)

iris %>%
  filter(Species %in% major_categories) %>%
  ggplot(aes(x = 1, y = Petal.Length)) +
    geom_point(stat = "summary", fun.y = "mean") +
    geom_errorbar(stat = "summary", fun.data = "mean_se", width = 0, fun.args = list(mult = 1.96)) +
    facet_grid(. ~ Species) +
    theme_bw()

iris %>%
  group_by(Species) %>%
  mutate(group_Petal.Length = sum(Petal.Length)) %>%
  ungroup() %>%
  mutate(share_Petal.Length = group_Petal.Length / sum(unique(group_Petal.Length))) %>%
  filter(share_Petal.Length >= 0.3) %>%
  ggplot(aes(x = 1, y = Petal.Length)) +
    geom_point(stat = "summary", fun.y = "mean") +
    geom_errorbar(stat = "summary", fun.data = "mean_se", width = 0, fun.args = list(mult = 1.96)) +
    facet_grid(. ~ Species) +
    theme_bw()

Also just want to note that if you don't have any values on the x-axis—here it's just a dummy value—you might as well skip the facetting and put Species on the x-axis. Not sure if that will still apply to your larger dataset.