Search code examples
rrolling-computationrunnerbatching

how to count a "batch" every 15 minutes


1 2021-01-01 12:59:38
2 2021-01-01 14:08:59
3 2021-01-01 14:09:08
4 2021-01-01 14:11:30
5 2021-01-01 14:22:19
6 2021-01-01 14:41:07

I want to be able to count the number of entries every 15 minutes but on a rolling basis. E.g 12:59 would be 1 within 15 mins, 14:08 => 14:22 would all be within 15 minutes so this would return 4 in this batch and finally 14:41 would be by itself in another 15 minute batch.

I hope this makes sense and thanks in advance.

Apologies for not including this

> dput(df)
structure(list(ClickedDate = structure(c(1609460198.707, 1609462979.593, 
1609465088.437, 1609476270.88, 1609478479.177, 1609479667.373, 
1609493081.887, 1609499187.29, 1609507506.37, 1609510989.533, 
1609511522.023, 1609511894.067, 1609512194.773, 1609512377.227, 
1609514474.153), tzone = "UTC", class = c("POSIXct", "POSIXt"
)), batch_no = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 
12L, 12L, 12L, 13L), batch_size = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 3L, 3L, 3L, 1L)), row.names = c(NA, -15L), class = c("tbl_df", 
"tbl", "data.frame"))

NEW EDIT - Thank you for working on this. I am getting an error

Error in UseMethod("mutate") : 
  no applicable method for 'mutate' applied to an object of class "c('integer', 'numeric')"

This seems odd, my variable is in class

> class(df$ClickedDate)
[1] "POSIXct" "POSIXt" 

Does this work with mutate, or do I need to convert this?

> dput(df)
structure(list(ClickedDate = structure(c(1609460198.707, 1609462979.593, 
1609465088.437, 1609476270.88, 1609478479.177, 1609479667.373, 
1609493081.887, 1609499187.29, 1609507506.37, 1609510989.533, 
1609511522.023, 1609511894.067, 1609512194.773, 1609512377.227, 
1609514474.153), tzone = "UTC", class = c("POSIXct", "POSIXt"
)), batch_no = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 
12L, 12L, 12L, 13L), batch_size = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 3L, 3L, 3L, 1L)), row.names = c(NA, -15L), class = c("tbl_df", 
"tbl", "data.frame"))

Thanks in advance


Solution

  • Usage of runner package will help in this scenario. Use the following strategy

    library(tidyverse)
    library(runner)
    
    df %>% mutate(b_len = runner::runner(x = ClickedDate,
                                 idx = ClickedDate,
                                 k = "15 mins",
                                 lag = "-14 mins",
                                 f = length),
                  b_no = purrr::accumulate(seq_len(length(b_len)-1), .init = b_len[1], ~ifelse(.x > .y, .x, .x + b_len[.x +1])),
                  b_no = as.integer(as.factor(b_no))) %>%
      group_by(b_no) %>%
      mutate(b_len = n())
    
    # A tibble: 15 x 3
    # Groups:   b_no [12]
       ClickedDate         b_len  b_no
       <dttm>              <int> <int>
     1 2021-01-01 00:16:38     1     1
     2 2021-01-01 01:02:59     1     2
     3 2021-01-01 01:38:08     1     3
     4 2021-01-01 04:44:30     1     4
     5 2021-01-01 05:21:19     1     5
     6 2021-01-01 05:41:07     1     6
     7 2021-01-01 09:24:41     1     7
     8 2021-01-01 11:06:27     1     8
     9 2021-01-01 13:25:06     1     9
    10 2021-01-01 14:23:09     2    10
    11 2021-01-01 14:32:02     2    10
    12 2021-01-01 14:38:14     3    11
    13 2021-01-01 14:43:14     3    11
    14 2021-01-01 14:46:17     3    11
    15 2021-01-01 15:21:14     1    12
    
    

    Notes -

    • lag argument in runner function allows a backward time window (rolling) so I am using negative lag to use forward time window.
    • k argument in runner is for given length of rolling window
    • b_no column initially identifies the sliding/rolling window upto the earliest window is exhausted and thereafter takes new window.
    • dense_rank can also be used (see alternative below)

    Alternatively

    df %>% mutate(b_len = runner::runner(x = ClickedDate,
                                 idx = ClickedDate,
                                 k = "15 mins",
                                 lag = "-14 mins",
                                 f = length),
                  b_no = purrr::accumulate(seq_len(length(b_len)-1), .init = b_len[1], ~ifelse(.x > .y, .x, .x + b_len[.x +1])),
                  b_no = dense_rank(b_no)) %>%
      group_by(b_no) %>%
      mutate(b_len = n()) %>%
      ungroup()
    # A tibble: 15 x 3
       ClickedDate         b_len  b_no
       <dttm>              <int> <int>
     1 2021-01-01 00:16:38     1     1
     2 2021-01-01 01:02:59     1     2
     3 2021-01-01 01:38:08     1     3
     4 2021-01-01 04:44:30     1     4
     5 2021-01-01 05:21:19     1     5
     6 2021-01-01 05:41:07     1     6
     7 2021-01-01 09:24:41     1     7
     8 2021-01-01 11:06:27     1     8
     9 2021-01-01 13:25:06     1     9
    10 2021-01-01 14:23:09     2    10
    11 2021-01-01 14:32:02     2    10
    12 2021-01-01 14:38:14     3    11
    13 2021-01-01 14:43:14     3    11
    14 2021-01-01 14:46:17     3    11
    15 2021-01-01 15:21:14     1    12
    

    data used

    df
    > df
    # A tibble: 15 x 1
       ClickedDate        
       <dttm>             
     1 2021-01-01 00:16:38
     2 2021-01-01 01:02:59
     3 2021-01-01 01:38:08
     4 2021-01-01 04:44:30
     5 2021-01-01 05:21:19
     6 2021-01-01 05:41:07
     7 2021-01-01 09:24:41
     8 2021-01-01 11:06:27
     9 2021-01-01 13:25:06
    10 2021-01-01 14:23:09
    11 2021-01-01 14:32:02
    12 2021-01-01 14:38:14
    13 2021-01-01 14:43:14
    14 2021-01-01 14:46:17
    15 2021-01-01 15:21:14