Search code examples
rdplyrgroup-bysummarize

R sum using group_by, select groups with large sum, then repeat sum using different group_by


I have a working solution to this problem, but I'm curious if there are ways to do this more efficiently than I did.

I have a dataframe that looks like below:

df <- data.frame("Factor1" = c(rep("A", 4), rep("B", 4), rep("C", 4), rep("D", 4)),
                 "Factor2" = c(rep(c(rep("a", 2), rep("b", 2)), 4)),
                 "Value" = c(c(1:4), c(21:24), c(11:14), c(41:44)))

I first summed all Value for each Factor1 to identify which Factor1 has the largest sum of Value, then select the top n (2 in this example) sum of Value like below:

df2 <- df %>% group_by(Factor1) %>%
           summarise(Factor1.sum = sum(Value)) %>%
           arrange(desc(Factor1.sum)) %>%
           top_n(n=2) %>% ungroup()

Now that I identified B and D as the top two Factor1 with the largest sum of Value, I would again like to calculate the sum of Value, but this time grouping by both Factor1 and Factor2, like below.

df3 <- subset(df, Factor1 %in% df2$Factor1) %>%
          group_by(Factor1, Factor2) %>%
          summarise(Factor2.sum = sum(Value))

This gives me just what I want, but I need to create extra files (df2 and df3), and I'm wondering if there's some function in the dplyr package (or really any package) that I'm unaware of that will make this process simpler?


Solution

  • Here's a shorter variation of the same thing, using count as a shortcut for group_by & summarize, and left_join to take the result of the first step as the list of Factor1's for the second part.

    count(df, Factor1, wt = Value, sort = TRUE) %>% 
      slice(1:2) %>%
      left_join(count(df, Factor1, Factor2, wt = Value, name = "Factor2.sum"))
    
    
    Joining, by = "Factor1"
      Factor1   n Factor2 Factor2.sum
    1       D 170       a          83
    2       D 170       b          87
    3       B  90       a          43
    4       B  90       b          47