Search code examples
rdplyrtidyrspread

Dplyr filtering based on two variables


I want to use dplyr to determine which observations in a dataframe meet the following condition:

  • Within each Group, the combined total of Var2 for observations where Var1 == good is greater than the combined total of observations whereVar1 == bad

Here's the toy dataframe:

library(dplyr)

set.seed(seed = 10)

df <- data.frame("Id" = 1:12,
                 "Group" = paste(sapply(toupper(letters[1:3]), rep, times = 4,simplify = T)),
                 "Var1" = sample(rep(c("good","bad"),times = 1000),size = 12),
                 "Var2" = sample(rep(1:10, times = 1000),size = 12))

print(df)

   Id Group Var1 Var2
1   1     A good    6
2   2     A  bad    9
3   3     A good   10
4   4     A good    7
5   5     B  bad    9
6   6     B  bad    1
7   7     B  bad    6
8   8     B good    6
9   9     C good    1
10 10     C  bad    8
11 11     C good    4
12 12     C  bad    2

So far I've determined that I should be using some combination of group_by(),summarise(), and filter() but I can't seem to wrap my head around a good way to do it. Here's what I've come up with so far:

keepers <- df %>% 
        group_by(Group, Var1) %>%
        summarise(Total = sum(Var2)) %>% 
        print()

Source: local data frame [6 x 3]
Groups: Group [?]

  Group  Var1 Total
  (chr) (chr) (int)
1     A   bad     9
2     A  good    23
3     B   bad    16
4     B  good     6
5     C   bad    10
6     C  good     5

What next steps should I take? Ultimately the analysis should return "A", because it's the only Group where Total is greater for the good observations than for the bad observations.


Solution

  • How about using spread than filter:

    > library(tidyr)
    > df %>% group_by(Group, Var1) %>%
    +    summarise(Total = sum(Var2)) %>%
    +    spread(Var1,Total) %>%
    +    filter(good>bad)
    Source: local data frame [1 x 3]
    
      Group bad good
    1     A   9   23