Search code examples
rapache-sparkpysparkdplyrsparklyr

Getting counts of membership in combination of groups using sparklyr or dplyr


I have a spark dataframe I'm manipulating using sparklyr that looks like the following:

input_data <- data.frame(id = c(10,10,10,20,20,30,30,40,40,40,50,60,70, 80,80,80,100,100,110,110,120,120,120,130,140,150,160,170), 
           date = c("2021-01-01","2021-01-02","2021-01-03","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-02","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-05","2021-01-01","2021-01-02","2021-01-03","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-02","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-05","2021-01-01","2021-01-05"), 
           group = c("A", "B", "C", "B", "C", "A", "C", "A", "A", "A", "C", "A","B","A", "B", "C", "B", "C", "A", "C", "A", "A", "A", "C", "A", "A", "B","A"), 
           event = c(1,1,1,0,1,0,1,0,0,1,1,1,0,1,1,1,0,1,0,1,0,0,1,1,1,1,1,0))

enter image description here

I'd like to aggregate the data such that I have a count of the number of "events" (where event == 1 ) and "non_events" (where event == 0) for each combination such that the final output looks like the following:

data.frame(group_a = c(1,0,0,1,0,1), 
           group_b = c(0,1,0,1,1,0), 
           group_c = c(0,0,1,0,1,1), 
           event_occured = c(3,1,2,0,2,2), 
           event_not_occured = c(4,2,2,0,2,2))

enter image description here

So, for example, there were no combinations where A and B were groups for the same ID so that gets a 0 for event and non_event. There were 4 IDs in which group A was involved in, of which 3 resulted in an event and 1 resulted in a non_event, so on and so forth.

What approach using sparklyr (or dplyr or pyspark) would allow for aggregation as described above? I tried the following but I'm getting the exact same number of event as event_not_occurred so I must be doing something wrong but can't pinpoint it:

combo_path_sdf <- input_data %>%
  group_by(id) %>%
  arrange(date) %>%
  mutate(order_seq = ifelse(event > 0, 1, NA)) %>%
  mutate(order_seq = lag(cumsum(ifelse(is.na(order_seq), 0, order_seq)))) %>%
  mutate(order_seq = ifelse((row_number() == 1) & (event > 0), -1, ifelse(row_number() == 1, 0, order_seq))) %>% 
  ungroup()

    combo_path_sdf %>%
      group_by(id, order_seq) %>%
      summarize(group_a = max(ifelse(group_a == "A", 1, 0)),
                group_b = max(ifelse(group_b == "B", 1, 0)),
                group_c = max(ifelse(group_c == "C", 1, 0)),
                events = sum(event)) %>%
      group_by(order_seq, group_a, group_b, group_c) %>% 
      summarize(event = sum(events),
                total_sequences = n()) %>%
      mutate(event_not_occured = total_sequences - event)

Final output in the following format would be ok too:

data.frame(group_a = c("A", "B", "C", "A,B", "B,C", "A,C"), 
           event_occured = c(3,1,2,1,2,2), 
           event_not_occured = c(4,2,2,1,2,2))

(image below for A,B is incorrect, should be 1,1 not 0,0) enter image description here


Solution

  • The following matches your requested output format, and process the data in the way I understand you want, but (as per the comment by @Martin Gal) does not match the example result you provided.

    input_data %>%
      group_by(id) %>%
      summarise(group_a = max(ifelse(group == 'A', 1, 0)),
                group_b = max(ifelse(group == 'B', 1, 0)),
                group_c = max(ifelse(group == 'C', 1, 0)),
                event_occured = sum(ifelse(event == 1, 1, 0)),
                event_not_occured = sum(ifelse(event == 0, 1, 0)),
                .groups = "drop") %>%
      group_by(group_a, group_b, group_c) %>%
      summarise(event_occured = sum(event_occured),
                event_not_occured = sum(event_not_occured),
                .groups = "drop")
    

    This idea is a two step summary process. The first summarise creates an indicator for group from each event and counts the number of events/non-events. The second summarise, combines all similar groups.

    Regarding the code you are using that produces the same number of events and non-events. Take a look at hts_combined. This is not defined in the code you have shared and hence your script might be reading a variable from elsewhere.