I have a spark dataframe I'm manipulating using sparklyr that looks like the following:
input_data <- data.frame(id = c(10,10,10,20,20,30,30,40,40,40,50,60,70, 80,80,80,100,100,110,110,120,120,120,130,140,150,160,170),
date = c("2021-01-01","2021-01-02","2021-01-03","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-02","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-05","2021-01-01","2021-01-02","2021-01-03","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-02","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-05","2021-01-01","2021-01-05"),
group = c("A", "B", "C", "B", "C", "A", "C", "A", "A", "A", "C", "A","B","A", "B", "C", "B", "C", "A", "C", "A", "A", "A", "C", "A", "A", "B","A"),
event = c(1,1,1,0,1,0,1,0,0,1,1,1,0,1,1,1,0,1,0,1,0,0,1,1,1,1,1,0))
I'd like to aggregate the data such that I have a count of the number of "events" (where event == 1
) and "non_events" (where event == 0
) for each combination such that the final output looks like the following:
data.frame(group_a = c(1,0,0,1,0,1),
group_b = c(0,1,0,1,1,0),
group_c = c(0,0,1,0,1,1),
event_occured = c(3,1,2,0,2,2),
event_not_occured = c(4,2,2,0,2,2))
So, for example, there were no combinations where A and B were groups for the same ID so that gets a 0 for event
and non_event
. There were 4 IDs in which group A was involved in, of which 3 resulted in an event
and 1 resulted in a non_event
, so on and so forth.
What approach using sparklyr (or dplyr or pyspark) would allow for aggregation as described above? I tried the following but I'm getting the exact same number of event
as event_not_occurred
so I must be doing something wrong but can't pinpoint it:
combo_path_sdf <- input_data %>%
group_by(id) %>%
arrange(date) %>%
mutate(order_seq = ifelse(event > 0, 1, NA)) %>%
mutate(order_seq = lag(cumsum(ifelse(is.na(order_seq), 0, order_seq)))) %>%
mutate(order_seq = ifelse((row_number() == 1) & (event > 0), -1, ifelse(row_number() == 1, 0, order_seq))) %>%
ungroup()
combo_path_sdf %>%
group_by(id, order_seq) %>%
summarize(group_a = max(ifelse(group_a == "A", 1, 0)),
group_b = max(ifelse(group_b == "B", 1, 0)),
group_c = max(ifelse(group_c == "C", 1, 0)),
events = sum(event)) %>%
group_by(order_seq, group_a, group_b, group_c) %>%
summarize(event = sum(events),
total_sequences = n()) %>%
mutate(event_not_occured = total_sequences - event)
Final output in the following format would be ok too:
data.frame(group_a = c("A", "B", "C", "A,B", "B,C", "A,C"),
event_occured = c(3,1,2,1,2,2),
event_not_occured = c(4,2,2,1,2,2))
The following matches your requested output format, and process the data in the way I understand you want, but (as per the comment by @Martin Gal) does not match the example result you provided.
input_data %>%
group_by(id) %>%
summarise(group_a = max(ifelse(group == 'A', 1, 0)),
group_b = max(ifelse(group == 'B', 1, 0)),
group_c = max(ifelse(group == 'C', 1, 0)),
event_occured = sum(ifelse(event == 1, 1, 0)),
event_not_occured = sum(ifelse(event == 0, 1, 0)),
.groups = "drop") %>%
group_by(group_a, group_b, group_c) %>%
summarise(event_occured = sum(event_occured),
event_not_occured = sum(event_not_occured),
.groups = "drop")
This idea is a two step summary process. The first summarise creates an indicator for group from each event and counts the number of events/non-events. The second summarise, combines all similar groups.
Regarding the code you are using that produces the same number of events and non-events. Take a look at hts_combined
. This is not defined in the code you have shared and hence your script might be reading a variable from elsewhere.