I observe the number of purchases of (in the example below: 4) different customers on (five) different days. Now I want to create a new variable summing up the number of purchases of every single user during the last 20 purchases that have been made in total, across users.
Example data:
> da <- data.frame(customer_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
+ day = c("2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15"),
+ n_purchase = c(5,2,8,0,3,2,0,3,4,0,2,4,5,1,0,2,3,5,0,3))
> da
customer_id day n_purchase
1 1 2016-04-11 5
2 1 2016-04-12 2
3 1 2016-04-13 8
4 1 2016-04-14 0
5 1 2016-04-15 3
6 2 2016-04-11 2
7 2 2016-04-12 0
8 2 2016-04-13 3
9 2 2016-04-14 4
10 2 2016-04-15 0
11 3 2016-04-11 2
12 3 2016-04-12 4
13 3 2016-04-13 5
14 3 2016-04-14 1
15 3 2016-04-15 0
16 4 2016-04-11 2
17 4 2016-04-12 3
18 4 2016-04-13 5
19 4 2016-04-14 0
20 4 2016-04-15 3
I need to know three things to construct my variable: (1) What's the overall number of purchases on a day across users (day purchases)? (2) What's the cumulative number of purchases across users starting from the first day (cumsum_day_purchases)? (3) On which day did, originating from the current observation, the 20 immediately precending (across users) purchases start? This is where I have issues with coding such a variable.
> library(dplyr)
> da %>%
+ group_by(day) %>%
+ mutate(day_purchases = sum(n_purchase)) %>%
+ group_by(customer_id) %>%
+ mutate(cumsum_day_purchases = cumsum(day_purchases))
# A tibble: 20 x 5
# Groups: customer_id [4]
customer_id day n_purchase day_purchases cumsum_day_purchases
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 2016-04-11 5 11 11
2 1 2016-04-12 2 9 20
3 1 2016-04-13 8 21 41
4 1 2016-04-14 0 5 46
5 1 2016-04-15 3 6 52
6 2 2016-04-11 2 11 11
7 2 2016-04-12 0 9 20
8 2 2016-04-13 3 21 41
9 2 2016-04-14 4 5 46
10 2 2016-04-15 0 6 52
11 3 2016-04-11 2 11 11
12 3 2016-04-12 4 9 20
13 3 2016-04-13 5 21 41
14 3 2016-04-14 1 5 46
15 3 2016-04-15 0 6 52
16 4 2016-04-11 2 11 11
17 4 2016-04-12 3 9 20
18 4 2016-04-13 5 21 41
19 4 2016-04-14 0 5 46
20 4 2016-04-15 3 6 52
I will now in the following dataset compute the variable I wish to have by hand.
Resulting in the following output:
> da = da %>% ungroup() %>%
+ mutate(cumsum_last_20_purchases = c(5,5+2,8,0,0+3,2,2+0,3,4,4+0,2,2+4,5,1,1+0,2,2+3,5,0,0+3))
> da
# A tibble: 20 x 6
customer_id day n_purchase day_purchases cumsum_day_purchases cumsum_last_20_purchases
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 2016-04-11 5 11 11 5
2 1 2016-04-12 2 9 20 7
3 1 2016-04-13 8 21 41 8
4 1 2016-04-14 0 5 46 0
5 1 2016-04-15 3 6 52 3
6 2 2016-04-11 2 11 11 2
7 2 2016-04-12 0 9 20 2
8 2 2016-04-13 3 21 41 3
9 2 2016-04-14 4 5 46 4
10 2 2016-04-15 0 6 52 4
11 3 2016-04-11 2 11 11 2
12 3 2016-04-12 4 9 20 6
13 3 2016-04-13 5 21 41 5
14 3 2016-04-14 1 5 46 1
15 3 2016-04-15 0 6 52 1
16 4 2016-04-11 2 11 11 2
17 4 2016-04-12 3 9 20 5
18 4 2016-04-13 5 21 41 5
19 4 2016-04-14 0 5 46 0
20 4 2016-04-15 3 6 52 3
We can create a new grouping based on the last day the day_purchase
columns is above 20, and then use cumsum
on that:
library(dplyr)
da %>%
group_by(day) %>%
mutate(day_purchases = sum(n_purchase)) %>%
group_by(customer_id) %>%
mutate(above = with(rle(day_purchases >= 20), rep(1:length(lengths), lengths))) %>%
group_by(above, .add =TRUE) %>%
mutate(cumsum_last_20_purchases = cumsum(n_purchase))
#> # A tibble: 20 x 6
#> # Groups: customer_id, above [12]
#> customer_id day n_purchase day_purchases above cumsum_last_20_purchas…
#> <dbl> <fct> <dbl> <dbl> <int> <dbl>
#> 1 1 2016-04-11 5 11 1 5
#> 2 1 2016-04-12 2 9 1 7
#> 3 1 2016-04-13 8 21 2 8
#> 4 1 2016-04-14 0 5 3 0
#> 5 1 2016-04-15 3 6 3 3
#> 6 2 2016-04-11 2 11 1 2
#> 7 2 2016-04-12 0 9 1 2
#> 8 2 2016-04-13 3 21 2 3
#> 9 2 2016-04-14 4 5 3 4
#> 10 2 2016-04-15 0 6 3 4
#> 11 3 2016-04-11 2 11 1 2
#> 12 3 2016-04-12 4 9 1 6
#> 13 3 2016-04-13 5 21 2 5
#> 14 3 2016-04-14 1 5 3 1
#> 15 3 2016-04-15 0 6 3 1
#> 16 4 2016-04-11 2 11 1 2
#> 17 4 2016-04-12 3 9 1 5
#> 18 4 2016-04-13 5 21 2 5
#> 19 4 2016-04-14 0 5 3 0
#> 20 4 2016-04-15 3 6 3 3
Created on 2020-07-28 by the reprex package (v0.3.0)