I want to calculate a some descriptive statistics:
customer_id
field)Example of a data frame:
dt <- tibble(
order_id = 1:10,
customer_id = c(1, rep(2, 2), rep(3, 3), rep(4, 4)),
amount = seq(10, 100, by = 10),
date = c("2020-10-07 12:00", # 1st customer
"2020-10-07 12:00", "2020-10-08 11:00", # 2st customer
"2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", # 3rd customer
"2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", "2020-10-08 21:00") # 4th customer
) %>%
mutate(
date = lubridate::ymd_hm(date)
)
Expected result:
order_id customer_id amount date amount__n__last_24h amount__mean__last_24h
<int> <dbl> <dbl> <dttm> <dbl> <dbl>
1 1 10 2020-10-07 12:00:00 0 NA
2 2 20 2020-10-07 12:00:00 0 NA
3 2 30 2020-10-08 11:00:00 1 20
4 3 40 2020-10-07 12:00:00 0 NA
5 3 50 2020-10-08 11:00:00 1 40
6 3 60 2020-10-08 20:00:00 1 50
7 4 70 2020-10-07 12:00:00 0 NA
8 4 80 2020-10-08 11:00:00 1 70
9 4 90 2020-10-08 20:00:00 1 80
10 4 100 2020-10-08 21:00:00 2 85
How can I do this?
Here is an option using data.table
:
dt[, dayago := date - 24 * 60 * 60]
dt[, c("n", "avg") :=
dt[dt, on=.(customer_id, date>=dayago, date<date),
by=.EACHI, .(n=.N, avg=mean(amount))][, (1L:3L) := NULL]
]
data:
library(data.table)
dt <- data.table(
order_id = 1:10,
customer_id = c(1, rep(2, 2), rep(3, 3), rep(4, 4)),
amount = seq(10, 100, by = 10),
date = as.POSIXct(c("2020-10-07 12:00", # 1st customer
"2020-10-07 12:00", "2020-10-08 11:00", # 2st customer
"2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", # 3rd customer
"2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", "2020-10-08 21:00" # 4th customer
), format=("%Y-%m-%d %H:%M")))