Search code examples
rtime-seriesxtssliding-window

Sliding window on date-time field in data frame R


I want to calculate a some descriptive statistics:

  • by specific customer (in example, customer_id field)
  • for a sliding time window (in example, 24h)
  • excluding the current value (using only previous values in period)

Example of a data frame:

dt <- tibble(
    order_id = 1:10,
    customer_id = c(1, rep(2, 2), rep(3, 3), rep(4, 4)),
    amount = seq(10, 100, by = 10),
    date = c("2020-10-07 12:00", # 1st customer
             "2020-10-07 12:00", "2020-10-08 11:00", # 2st customer
             "2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", # 3rd customer
             "2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", "2020-10-08 21:00") # 4th customer
  ) %>% 
  mutate(
    date = lubridate::ymd_hm(date)
  )

Expected result:

order_id customer_id amount date                amount__n__last_24h amount__mean__last_24h
<int>       <dbl>  <dbl> <dttm>                            <dbl>                  <dbl>
1           1     10 2020-10-07 12:00:00                   0                     NA
2           2     20 2020-10-07 12:00:00                   0                     NA
3           2     30 2020-10-08 11:00:00                   1                     20
4           3     40 2020-10-07 12:00:00                   0                     NA
5           3     50 2020-10-08 11:00:00                   1                     40
6           3     60 2020-10-08 20:00:00                   1                     50
7           4     70 2020-10-07 12:00:00                   0                     NA
8           4     80 2020-10-08 11:00:00                   1                     70
9           4     90 2020-10-08 20:00:00                   1                     80
10          4    100 2020-10-08 21:00:00                   2                     85

How can I do this?


Solution

  • Here is an option using data.table:

    dt[, dayago := date - 24 * 60 * 60]
    dt[, c("n", "avg") :=
        dt[dt, on=.(customer_id, date>=dayago, date<date),
            by=.EACHI, .(n=.N, avg=mean(amount))][, (1L:3L) := NULL]
        ]
    

    data:

    library(data.table)
    dt <- data.table(
        order_id = 1:10,
        customer_id = c(1, rep(2, 2), rep(3, 3), rep(4, 4)),
        amount = seq(10, 100, by = 10),
        date = as.POSIXct(c("2020-10-07 12:00", # 1st customer
            "2020-10-07 12:00", "2020-10-08 11:00", # 2st customer
            "2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", # 3rd customer
            "2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", "2020-10-08 21:00" # 4th customer
    ), format=("%Y-%m-%d %H:%M")))