Search code examples
rdplyrpaneldata-cleaningcumsum

Select precending X observations for operation in grouped data


I observe the number of purchases of different customers on a few days (1 = purchase, 0 = no purchase). Now for each day, I want to sum up the number of purchases of a specific customer during the preceding 2 days and the current day, so for 3 days in total.

Example data:

da <- data.frame(customer_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
                 day = c("2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15"),
                 purchase = c(1,1,1,0,1,0,0,1,1,0,1,1,1,1,0,1,0,1,0,1))

> da
   customer_id        day purchase
1            1 2016-04-11        1
2            1 2016-04-12        1
3            1 2016-04-13        1
4            1 2016-04-14        0
5            1 2016-04-15        1
6            2 2016-04-11        0
7            2 2016-04-12        0
8            2 2016-04-13        1
9            2 2016-04-14        1
10           2 2016-04-15        0
11           3 2016-04-11        1
12           3 2016-04-12        1
13           3 2016-04-13        1
14           3 2016-04-14        1
15           3 2016-04-15        0
16           4 2016-04-11        1
17           4 2016-04-12        0
18           4 2016-04-13        1
19           4 2016-04-14        0
20           4 2016-04-15        1

The output I am looking for:

output_da <- data.frame(customer_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
                 day = c("2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15"),
                 purchase = c(1,1,1,0,1,0,0,1,1,0,1,1,1,1,0,1,0,1,0,1),
                 purchases_last_3_days = c(1,2,3,2,2,0,0,1,2,2,1,2,3,3,2,1,1,2,1,2))

output_da
   customer_id        day purchase purchases_last_3_days
1            1 2016-04-11        1                     1
2            1 2016-04-12        1                     2
3            1 2016-04-13        1                     3
4            1 2016-04-14        0                     2
5            1 2016-04-15        1                     2
6            2 2016-04-11        0                     0
7            2 2016-04-12        0                     0
8            2 2016-04-13        1                     1
9            2 2016-04-14        1                     2
10           2 2016-04-15        0                     2
11           3 2016-04-11        1                     1
12           3 2016-04-12        1                     2
13           3 2016-04-13        1                     3
14           3 2016-04-14        1                     3
15           3 2016-04-15        0                     2
16           4 2016-04-11        1                     1
17           4 2016-04-12        0                     1
18           4 2016-04-13        1                     2
19           4 2016-04-14        0                     1
20           4 2016-04-15        1                     2

I know about the cumsum function, but I don't know how to select the X number of rows precending every current day.


Solution

  • If you need only the last 3 rows, you can do it manually:

    library(dplyr)
    output_da <- da %>% 
        group_by(customer_id) %>% 
        mutate(pday = lag(purchase, default = 0), 
               apday = lag(purchase, n = 2, default = 0), 
               purchases_last_3_days = rowSums(across(c(purchase, pday, apday))),
               pday = NULL, apday = NULL)
    

    UPDATE

    And if you need more than the last 3 rows, it is possible to automatize a bit, changing the 3 by the suitable number:

    library(dplyr)
    library(purrr)
    last_day_rows <- 3
    da %>% 
        group_by(customer_id) %>% 
        mutate(purchases_last_3_days = 
               rowSums(map_dfc(c(1:last_day_rows), 
               ~lag(purchase, n = .x - 1, default = 0))))