Search code examples
rdataframedplyrgroup-by

Counting events in a group_by manner (R)


Here is my code:

set.seed(23)
data_toy <- tibble(
  family_code = sample(factor(400:410),1000,T),
  event_type = factor(sample(c("sad","happy"),1000,
                replace = TRUE,prob = c(.2,.8))),
  score = sample(1:100,1000,TRUE)
) %>% mutate(score = if_else(event_type =="happy",NA,score)) %>% 
  arrange(family_code)

Output:

family_code event_type score
   <fct>       <fct>      <int>
 1 400         happy         NA
 2 400         happy         NA
 3 400         happy         NA
 4 400         happy         NA
 5 400         sad           57
 6 400         happy         NA
 7 400         happy         NA
 8 400         happy         NA
 9 400         happy         NA
10 400         sad           65

I would like to create a feature that counts the number of happy events until a sad event for each family.

In the example I shared, my desired output would be:

family_code event_type score happy_counter
   <fct>       <fct>      <int>         <dbl>
 1 400         happy         NA            NA
 2 400         happy         NA            NA
 3 400         happy         NA            NA
 4 400         happy         NA            NA
 5 400         sad           57             4
 6 400         happy         NA            NA
 7 400         happy         NA            NA
 8 400         happy         NA            NA
 9 400         happy         NA            NA
10 400         sad           65             4
11 400         happy         NA            NA
12 400         happy         NA            NA
13 400         happy         NA            NA
14 400         happy         NA            NA
15 400         happy         NA            NA
16 400         happy         NA            NA
17 400         happy         NA            NA
18 400         happy         NA            NA
19 400         sad           79             8
20 400         sad           78             0

My data has approx. 10k observations. I tried group_by and nest_by but struggled with zeroing the count after each sad event.


Solution

  • Try

    library(dplyr)
    out <- data_toy %>%
       group_by(family_code, ind = consecutive_id(event_type)) %>% 
       mutate(n = n()) %>% 
       slice_head(n = 1) %>%
       group_by(family_code) %>%
       mutate(n = lag(n) * NA^(event_type == "happy")) %>%
       ungroup %>%
       select(ind, family_code, event_type, happy_counter = n) %>%
       left_join(data_toy %>% 
       mutate(ind = consecutive_id(event_type)), .) %>% 
       group_by(family_code, ind) %>% 
       mutate(happy_counter = happy_counter * (all(event_type == "sad") & 
         !duplicated(happy_counter))) %>%
       ungroup
    

    -output

    head(out, 20)
    # A tibble: 20 × 5
       family_code event_type score   ind happy_counter
       <fct>       <fct>      <int> <int>         <dbl>
     1 400         happy         NA     1            NA
     2 400         happy         NA     1            NA
     3 400         happy         NA     1            NA
     4 400         happy         NA     1            NA
     5 400         sad           57     2             4
     6 400         happy         NA     3            NA
     7 400         happy         NA     3            NA
     8 400         happy         NA     3            NA
     9 400         happy         NA     3            NA
    10 400         sad           65     4             4
    11 400         happy         NA     5            NA
    12 400         happy         NA     5            NA
    13 400         happy         NA     5            NA
    14 400         happy         NA     5            NA
    15 400         happy         NA     5            NA
    16 400         happy         NA     5            NA
    17 400         happy         NA     5            NA
    18 400         happy         NA     5            NA
    19 400         sad           79     6             8
    20 400         sad           78     6             0