Search code examples
rdplyrsummarize

Conditional summarize of groups in dplyr based on date


I'm an R noob, and trying to perform a summary on a dataset which totals the number of event types for each ID that occurred between events of type 'B' for that ID. Here is a sample to illustrate:

id <- c('1', '1', '1', '2', '2', '2', '3', '3', '3', '3')
type <- c('A', 'A', 'B', 'A', 'B', 'C', 'A', 'B', 'C', 'B')
datestamp <- as.Date(c('2016-06-20','2016-07-16','2016-08-14','2016-07-17'
                       ,'2016-07-18','2016-07-19','2016-07-16','2016-07-19'
                       , '2016-07-21','2016-08-20'))
df <- data.frame(id, type, datestamp)

which produces:

> df
   id type  datestamp
1   1    A 2016-06-20
2   1    A 2016-07-16
3   1    B 2016-08-14
4   2    A 2016-07-17
5   2    B 2016-07-18
6   2    C 2016-07-19
7   3    A 2016-07-16
8   3    B 2016-07-19
9   3    C 2016-07-21
10  3    B 2016-08-20

Any time an event 'B' occurs, I want to know the number of each event type that occurred before that B event, but after any other B events for that ID. What I would like to end up with is a table like this:

  id type B_instance count
1  1    A          1     2
2  2    A          1     1
3  3    A          1     1
4  3    C          2     1

In researching, this question came the closest: summarizing a field based on the value of another field in dplyr

I've been trying to make this work:

  df2 <- df %>%
  group_by(id, type) %>%
  summarize(count = count(id[which(datestamp < datestamp[type =='B'])])) %>%
  filter(type != 'B')

But it errors out (also, even if it worked, it doesn't account for 2 'B' events in the same ID, like with id=3)


Solution

  • You can use cumsum to create the new group variable B_instance by doing cumsum(type == "B") and then filter out types that fall behind the last B as well as type B itself, as they will not be counted. Then use count to count the occurrence with the group by id, B_instance and type.

    df %>% 
           group_by(id) %>% 
           # create B_instance using cumsum on the type == "B" condition
           mutate(B_instance = cumsum(type == "B") + 1) %>%    
           # filter out rows with type behind the last B and all B types                 
           filter(B_instance < max(B_instance), type != "B") %>% 
           # count the occurrences of type grouped by id and B_instance
           count(id, type, B_instance) 
    
    # Source: local data frame [4 x 4]
    # Groups: id, type [?]
    
    #       id   type B_instance     n
    #   <fctr> <fctr>      <dbl> <int>
    # 1      1      A          1     2
    # 2      2      A          1     1
    # 3      3      A          1     1
    # 4      3      C          2     1