Search code examples
rdatetimerolling-computation

Rolling Window based on a fixed time interval


I'm trying to calculate a rolling window in a fixed time interval. Suppose that the interval is 48 hours. I would like to get every data point that is contained between the date of the current observation and 48 hours before that observation. For example, if the datetime of the current observation is 05-07-2022 14:15:28, for that position, I would like a count value for every occurence between that date and 03-07-2022 14:15:28. Seconds are not fundamental to the analysis.

library(tidyverse)
library(lubridate)

df = tibble(id = 1:7,
            date_time = ymd_hm('2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
                               '2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'))


# A tibble: 7 × 2
     id date_time          
  <int> <dttm>             
1     1 2022-05-07 15:00:00
2     2 2022-05-09 13:45:00
3     3 2022-05-09 13:51:00
4     4 2022-05-09 17:00:00
5     5 2022-05-10 15:25:00
6     6 2022-05-10 17:18:00
7     7 2022-05-11 14:00:00

With the example window of 48 hours, that would yield:

# A tibble: 7 × 4
     id date_time           lag_48hours         count
  <int> <dttm>              <dttm>              <dbl>
1     1 2022-05-07 15:00:00 2022-05-05 15:00:00     1
2     2 2022-05-09 13:45:00 2022-05-07 13:45:00     2
3     3 2022-05-09 13:51:00 2022-05-07 13:51:00     3
4     4 2022-05-09 17:00:00 2022-05-07 17:00:00     3
5     5 2022-05-10 15:25:00 2022-05-08 15:25:00     4
6     6 2022-05-10 17:18:00 2022-05-08 17:18:00     5
7     7 2022-05-11 14:00:00 2022-05-09 14:00:00     4

I added the lag column for illustration purposes. Any idea how to obtain the count column? I need to be able to adjust the window (48 hours in this example).


Solution

  • I'd encourage you to use slider, which allows you to do rolling window analysis using an irregular index.

    library(tidyverse)
    library(lubridate)
    library(slider)
    
    df = tibble(
      id = 1:7,
      date_time = ymd_hm(
        '2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
        '2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'
      )
    )
    
    df %>%
      mutate(
        count = slide_index_int(
          .x = id, 
          .i = date_time, 
          .f = length, 
          .before = dhours(48)
        )
      )
    #> # A tibble: 7 × 3
    #>      id date_time           count
    #>   <int> <dttm>              <int>
    #> 1     1 2022-05-07 15:00:00     1
    #> 2     2 2022-05-09 13:45:00     2
    #> 3     3 2022-05-09 13:51:00     3
    #> 4     4 2022-05-09 17:00:00     3
    #> 5     5 2022-05-10 15:25:00     4
    #> 6     6 2022-05-10 17:18:00     5
    #> 7     7 2022-05-11 14:00:00     4