Search code examples
rdataframedplyrlag

Calculate time from the last purchase among the same product purchase


Problem statement:

You are given with the following data:

  • List of customer_id
  • List of product
  • Purchase time
  • Total same product purchase to date

To find:

  • Time_from_last_purchase among the same product

Expected output (the last column):

customer_id product purchase_time  total_to_date  time_from_last_purchase
1              A      2014-11-24         1            0
1              A      2018-02-21         2            1185
1              E      2014-01-08         1            0
2              J      2016-04-18         1            0
3              F      2017-06-12         1            0 
3              G      2017-06-23         1            0 
4              F      2017-09-27         1            0
4              F      2018-01-08         2            103
4              F      2018-02-08         3            31
4              F      2018-02-09         4            1 
4              F      2018-04-10         5            60

My approach:

  • If I do it manually, any customer buy a specific product for the first time, then the time_from_last_purchase is 0.
  • Any customer buy a product from the second time, then time_from_last_purchase will be equal to time_purchase of the current purchase - time_purchase of the previous purchase

I am quite new to R, so any help is greatly appreciated. Thank you!


Solution

  • With dplyr, you can try:

    df %>%
     group_by(customer_id, product) %>%
     mutate(purchase_time = as.Date(purchase_time, format = "%Y-%m-%d"),
            res = purchase_time - lag(purchase_time, default = first(purchase_time)))
    
       customer_id product purchase_time total_to_date res      
             <int> <chr>   <date>                <int> <time>   
     1           1 A       2014-11-24                1    0 days
     2           1 A       2018-02-21                2 1185 days
     3           1 E       2014-01-08                1    0 days
     4           2 J       2016-04-18                1    0 days
     5           3 F       2017-06-12                1    0 days
     6           3 G       2017-06-23                1    0 days
     7           4 F       2017-09-27                1    0 days
     8           4 F       2018-01-08                2  103 days
     9           4 F       2018-02-08                3   31 days
    10           4 F       2018-02-09                4    1 days
    11           4 F       2018-04-10                5   60 days
    

    Or if you need the results as a numeric variable:

    df %>%
     group_by(customer_id, product) %>%
     mutate(purchase_time = as.Date(purchase_time, format = "%Y-%m-%d"),
            res = as.numeric(purchase_time - lag(purchase_time, default = first(purchase_time))))
    
       customer_id product purchase_time total_to_date   res
             <int> <chr>   <date>                <int> <dbl>
     1           1 A       2014-11-24                1     0
     2           1 A       2018-02-21                2  1185
     3           1 E       2014-01-08                1     0
     4           2 J       2016-04-18                1     0
     5           3 F       2017-06-12                1     0
     6           3 G       2017-06-23                1     0
     7           4 F       2017-09-27                1     0
     8           4 F       2018-01-08                2   103
     9           4 F       2018-02-08                3    31
    10           4 F       2018-02-09                4     1
    11           4 F       2018-04-10                5    60