Search code examples
rdplyrsumpipecumsum

Sum up ending with the current observation starting based on a criteria


I observe the number of purchases of (in the example below: 4) different customers on (five) different days. Now I want to create a new variable summing up the number of purchases of every single user during the last 20 purchases that have been made in total, across users.

Example data:

> da <- data.frame(customer_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
+                  day = c("2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15"),
+                  n_purchase = c(5,2,8,0,3,2,0,3,4,0,2,4,5,1,0,2,3,5,0,3))
> da
   customer_id        day n_purchase
1            1 2016-04-11          5
2            1 2016-04-12          2
3            1 2016-04-13          8
4            1 2016-04-14          0
5            1 2016-04-15          3
6            2 2016-04-11          2
7            2 2016-04-12          0
8            2 2016-04-13          3
9            2 2016-04-14          4
10           2 2016-04-15          0
11           3 2016-04-11          2
12           3 2016-04-12          4
13           3 2016-04-13          5
14           3 2016-04-14          1
15           3 2016-04-15          0
16           4 2016-04-11          2
17           4 2016-04-12          3
18           4 2016-04-13          5
19           4 2016-04-14          0
20           4 2016-04-15          3

I need to know three things to construct my variable: (1) What's the overall number of purchases on a day across users (day purchases)? (2) What's the cumulative number of purchases across users starting from the first day (cumsum_day_purchases)? (3) On which day did, originating from the current observation, the 20 immediately precending (across users) purchases start? This is where I have issues with coding such a variable.

> library(dplyr)
> da %>% 
+   group_by(day) %>% 
+   mutate(day_purchases = sum(n_purchase)) %>% 
+   group_by(customer_id) %>%
+   mutate(cumsum_day_purchases = cumsum(day_purchases))
# A tibble: 20 x 5
# Groups:   customer_id [4]
   customer_id day        n_purchase day_purchases cumsum_day_purchases
         <dbl> <fct>           <dbl>         <dbl>                <dbl>
 1           1 2016-04-11          5            11                   11
 2           1 2016-04-12          2             9                   20
 3           1 2016-04-13          8            21                   41
 4           1 2016-04-14          0             5                   46
 5           1 2016-04-15          3             6                   52
 6           2 2016-04-11          2            11                   11
 7           2 2016-04-12          0             9                   20
 8           2 2016-04-13          3            21                   41
 9           2 2016-04-14          4             5                   46
10           2 2016-04-15          0             6                   52
11           3 2016-04-11          2            11                   11
12           3 2016-04-12          4             9                   20
13           3 2016-04-13          5            21                   41
14           3 2016-04-14          1             5                   46
15           3 2016-04-15          0             6                   52
16           4 2016-04-11          2            11                   11
17           4 2016-04-12          3             9                   20
18           4 2016-04-13          5            21                   41
19           4 2016-04-14          0             5                   46
20           4 2016-04-15          3             6                   52

I will now in the following dataset compute the variable I wish to have by hand.

  • For all observations on day 2016-04-12 , I compute the cumulative sum of purchases of a specific customer by adding the number of purchases of the current day and the precending day, because in total all customers together made 20 purchases on the current day and the precending day.
  • For day 2016-04-13, I only use the number of purchases of a user on this day, because there have been 21 (41-20) new purchases on the day itself

Resulting in the following output:

> da = da %>% ungroup() %>%
+   mutate(cumsum_last_20_purchases = c(5,5+2,8,0,0+3,2,2+0,3,4,4+0,2,2+4,5,1,1+0,2,2+3,5,0,0+3))
> da
# A tibble: 20 x 6
   customer_id day        n_purchase day_purchases cumsum_day_purchases cumsum_last_20_purchases
         <dbl> <fct>           <dbl>         <dbl>                <dbl>                    <dbl>
 1           1 2016-04-11          5            11                   11                        5
 2           1 2016-04-12          2             9                   20                        7
 3           1 2016-04-13          8            21                   41                        8
 4           1 2016-04-14          0             5                   46                        0
 5           1 2016-04-15          3             6                   52                        3
 6           2 2016-04-11          2            11                   11                        2
 7           2 2016-04-12          0             9                   20                        2
 8           2 2016-04-13          3            21                   41                        3
 9           2 2016-04-14          4             5                   46                        4
10           2 2016-04-15          0             6                   52                        4
11           3 2016-04-11          2            11                   11                        2
12           3 2016-04-12          4             9                   20                        6
13           3 2016-04-13          5            21                   41                        5
14           3 2016-04-14          1             5                   46                        1
15           3 2016-04-15          0             6                   52                        1
16           4 2016-04-11          2            11                   11                        2
17           4 2016-04-12          3             9                   20                        5
18           4 2016-04-13          5            21                   41                        5
19           4 2016-04-14          0             5                   46                        0
20           4 2016-04-15          3             6                   52                        3

Solution

  • We can create a new grouping based on the last day the day_purchase columns is above 20, and then use cumsum on that:

    library(dplyr)
    
    da %>% 
      group_by(day) %>% 
      mutate(day_purchases = sum(n_purchase)) %>% 
      group_by(customer_id) %>% 
      mutate(above = with(rle(day_purchases >= 20), rep(1:length(lengths), lengths))) %>% 
      group_by(above, .add  =TRUE) %>% 
      mutate(cumsum_last_20_purchases = cumsum(n_purchase))
    #> # A tibble: 20 x 6
    #> # Groups:   customer_id, above [12]
    #>    customer_id day        n_purchase day_purchases above cumsum_last_20_purchas…
    #>          <dbl> <fct>           <dbl>         <dbl> <int>                   <dbl>
    #>  1           1 2016-04-11          5            11     1                       5
    #>  2           1 2016-04-12          2             9     1                       7
    #>  3           1 2016-04-13          8            21     2                       8
    #>  4           1 2016-04-14          0             5     3                       0
    #>  5           1 2016-04-15          3             6     3                       3
    #>  6           2 2016-04-11          2            11     1                       2
    #>  7           2 2016-04-12          0             9     1                       2
    #>  8           2 2016-04-13          3            21     2                       3
    #>  9           2 2016-04-14          4             5     3                       4
    #> 10           2 2016-04-15          0             6     3                       4
    #> 11           3 2016-04-11          2            11     1                       2
    #> 12           3 2016-04-12          4             9     1                       6
    #> 13           3 2016-04-13          5            21     2                       5
    #> 14           3 2016-04-14          1             5     3                       1
    #> 15           3 2016-04-15          0             6     3                       1
    #> 16           4 2016-04-11          2            11     1                       2
    #> 17           4 2016-04-12          3             9     1                       5
    #> 18           4 2016-04-13          5            21     2                       5
    #> 19           4 2016-04-14          0             5     3                       0
    #> 20           4 2016-04-15          3             6     3                       3
    

    Created on 2020-07-28 by the reprex package (v0.3.0)