I am trying to create summary statistics for a set of filters that operate in a lead/lag fashion.
A short description on lead/lag:
When a new filter is put online, it is put in the lag position meaning water passes through it after it passes through the primary (aka lead) filter. When the lead filter is clogged, the current lag filter is moved into the lead position. To summarize, a filter starts in the lag position then is bumped into the lead position.
Visually, you can imagine it like this:
What I need to do is sum the entire time a single filter was online, in both the lead or lag position.
Here's sample data:
structure(list(record_timestamp = structure(c(1608192000, 1608192060,1608192120, 1608192180, 1608192240, 1608192300, 1608192360, 1608192420,1608192480, 1608192540, 1608192600, 1608192660, 1608192720, 1608192780,1608192840, 1608192900, 1608192960, 1608193020, 1608193080, 1608193140,1608193200, 1608193260, 1608193320, 1608193380, 1608193440, 1608193500,1608193560, 1608193620, 1608193680, 1608193740, 1608193800), class = c("POSIXct","POSIXt"), tzone = "UTC"), flow = c(20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10), lag_start = structure(c(1608192000,1608192000, 1608192000, 1608192000, 1608192000, 1608192000, 1608192000,1608192000, 1608192000, 1608192000, 1608192000, 1608192660, 1608192660,1608192660, 1608192660, 1608192660, 1608192660, 1608192660, 1608192660,1608192660, 1608192660, 1608193260, 1608193260, 1608193260, 1608193260,1608193260, 1608193260, 1608193260, 1608193260, 1608193260, 1608193260), class = c("POSIXct", "POSIXt"), tzone = "UTC"), lead_start = c("#N/A","#N/A", "#N/A", "#N/A", "#N/A", "#N/A", "#N/A", "#N/A", "#N/A","#N/A", "#N/A", "12/17/2020 8:11", "12/17/2020 8:11", "12/17/2020 8:11","12/17/2020 8:11", "12/17/2020 8:11", "12/17/2020 8:11", "12/17/2020 8:11","12/17/2020 8:11", "12/17/2020 8:11", "12/17/2020 8:11", "12/17/2020 8:21","12/17/2020 8:21", "12/17/2020 8:21", "12/17/2020 8:21", "12/17/2020 8:21","12/17/2020 8:21", "12/17/2020 8:21", "12/17/2020 8:21", "12/17/2020 8:21","12/17/2020 8:21")), class = c("spec_tbl_df", "tbl_df", "tbl","data.frame"), row.names = c(NA, -31L), spec = structure(list(cols = list(record_timestamp = structure(list(), class = c("collector_character","collector")), flow = structure(list(), class = c("collector_double","collector")), polish_start = structure(list(), class = c("collector_character", "collector")), lead_start = structure(list(), class = c("collector_character","collector"))), default = structure(list(), class = c("collector_guess","collector")), skip = 1), class = "col_spec"))
My thought is to "unnest" them and just accept there will be duplicate timestamps, but each row will only be associated with one filter. Any thoughts on how to accomplish this? The unnested DF would look like:
structure(list(record_timestamp = structure(c(1608192000, 1608192060,1608192120, 1608192180, 1608192240, 1608192300, 1608192360, 1608192420,1608192480, 1608192540, 1608192600, 1608192660, 1608192720, 1608192780,1608192840, 1608192900, 1608192960, 1608193020, 1608193080, 1608193140,1608193200, 1608192660, 1608192720, 1608192780, 1608192840, 1608192900,1608192960, 1608193020, 1608193080, 1608193140, 1608193200, 1608193260,1608193320, 1608193380, 1608193440, 1608193500, 1608193560,1608193620,1608193680, 1608193740, 1608193800), class = c("POSIXct", "POSIXt"), tzone = "UTC"), flow = c(20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), lag_start = structure(c(1608192000, 1608192000, 1608192000,1608192000, 1608192000, 1608192000, 1608192000, 1608192000,1608192000,1608192000, 1608192000, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1608192660, 1608192660, 1608192660, 1608192660, 1608192660, 1608192660,1608192660, 1608192660, 1608192660, 1608192660, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), lead_start = structure(c(NA, NA, NA, NA, NA, NA, NA, NA,NA, NA, NA, 1608192660, 1608192660, 1608192660, 1608192660,1608192660, 1608192660, 1608192660, 1608192660, 1608192660,1608192660, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1608193260,1608193260, 1608193260, 1608193260, 1608193260, 1608193260,1608193260, 1608193260, 1608193260, 1608193260), class = c("POSIXct","POSIXt"), tzone = "UTC"), filter_id = c(1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -41L), spec = structure(list(cols = list(record_timestamp = structure(list(), class = c("collector_character","collector")), flow = structure(list(), class = c("collector_double","collector")), polish_start = structure(list(), class = c("collector_character","collector")), lead_start = structure(list(), class = c("collector_character", "collector")), filter_id = structure(list(), class = c("collector_double","collector"))), default = structure(list(), class = c("collector_guess","collector")), skip = 1), class = "col_spec"))
I realize however, this will double the size of the data I'm working with, which is already several years of one minute data. So if there is a way to do this without doubling up timestamps, that would be preferred.
Finally, the end goal is to have a small summary DF that looks like:
Filter ID | Total Flow
----------------------------
1 | 370
2 | 250
... | ...
Thanks for providing additional information. It seems that you could group_by
the lag_start
time alone. Then, you could calculate the total flow
while in that position (either lead or lag). After that, you can assign filter numbers in sequence, and then the total filter flow
would be the sum of flow
in both current row and following row. Does this give the desired result?
df %>%
group_by(lag_start) %>%
summarise(flow_per_position = sum(flow)) %>%
mutate(filter_id = row_number(),
total_filter_flow = flow_per_position + lead(flow_per_position, default = 0))
Output
lag_start flow_per_position filter_id total_filter_flow
<dttm> <dbl> <int> <dbl>
1 2020-12-17 08:00:00 220 1 370
2 2020-12-17 08:11:00 150 2 250
3 2020-12-17 08:21:00 100 3 100