I want to use dplyr
to determine which observations in a dataframe meet the following condition:
Group
, the combined total of Var2
for observations where Var1 == good
is greater than the combined total of observations whereVar1 == bad
Here's the toy dataframe:
library(dplyr)
set.seed(seed = 10)
df <- data.frame("Id" = 1:12,
"Group" = paste(sapply(toupper(letters[1:3]), rep, times = 4,simplify = T)),
"Var1" = sample(rep(c("good","bad"),times = 1000),size = 12),
"Var2" = sample(rep(1:10, times = 1000),size = 12))
print(df)
Id Group Var1 Var2
1 1 A good 6
2 2 A bad 9
3 3 A good 10
4 4 A good 7
5 5 B bad 9
6 6 B bad 1
7 7 B bad 6
8 8 B good 6
9 9 C good 1
10 10 C bad 8
11 11 C good 4
12 12 C bad 2
So far I've determined that I should be using some combination of group_by()
,summarise()
, and filter()
but I can't seem to wrap my head around a good way to do it. Here's what I've come up with so far:
keepers <- df %>%
group_by(Group, Var1) %>%
summarise(Total = sum(Var2)) %>%
print()
Source: local data frame [6 x 3]
Groups: Group [?]
Group Var1 Total
(chr) (chr) (int)
1 A bad 9
2 A good 23
3 B bad 16
4 B good 6
5 C bad 10
6 C good 5
What next steps should I take? Ultimately the analysis should return "A", because it's the only Group
where Total
is greater for the good
observations than for the bad
observations.
How about using spread
than filter
:
> library(tidyr)
> df %>% group_by(Group, Var1) %>%
+ summarise(Total = sum(Var2)) %>%
+ spread(Var1,Total) %>%
+ filter(good>bad)
Source: local data frame [1 x 3]
Group bad good
1 A 9 23