I have a working solution to this problem, but I'm curious if there are ways to do this more efficiently than I did.
I have a dataframe that looks like below:
df <- data.frame("Factor1" = c(rep("A", 4), rep("B", 4), rep("C", 4), rep("D", 4)),
"Factor2" = c(rep(c(rep("a", 2), rep("b", 2)), 4)),
"Value" = c(c(1:4), c(21:24), c(11:14), c(41:44)))
I first summed all Value
for each Factor1
to identify which Factor1
has the largest sum of Value
, then select the top n
(2 in this example) sum of Value
like below:
df2 <- df %>% group_by(Factor1) %>%
summarise(Factor1.sum = sum(Value)) %>%
arrange(desc(Factor1.sum)) %>%
top_n(n=2) %>% ungroup()
Now that I identified B
and D
as the top two Factor1
with the largest sum of Value
, I would again like to calculate the sum of Value
, but this time grouping by both Factor1
and Factor2
, like below.
df3 <- subset(df, Factor1 %in% df2$Factor1) %>%
group_by(Factor1, Factor2) %>%
summarise(Factor2.sum = sum(Value))
This gives me just what I want, but I need to create extra files (df2
and df3
), and I'm wondering if there's some function in the dplyr
package (or really any package) that I'm unaware of that will make this process simpler?
Here's a shorter variation of the same thing, using count
as a shortcut for group_by
& summarize
, and left_join
to take the result of the first step as the list of Factor1's for the second part.
count(df, Factor1, wt = Value, sort = TRUE) %>%
slice(1:2) %>%
left_join(count(df, Factor1, Factor2, wt = Value, name = "Factor2.sum"))
Joining, by = "Factor1"
Factor1 n Factor2 Factor2.sum
1 D 170 a 83
2 D 170 b 87
3 B 90 a 43
4 B 90 b 47