Search code examples
rdatedplyr

Calculate date variables for time series analysis using a dynamic approach


In the following example, I'd like to calculate the variables Date_replaced and Date_current_until from Date for each customer and product.

Date_replaced reflects the timepoint at which a product bought by a certain customer was replaced by the purchase of another sort of product. Date_current_until reflects the timepoint until a product was the current product for each customer.

library(dplyr)

Customer_ID <- c(rep(1,4), rep(2,5))
Date <- c(seq(as.Date("2024-02-01"), as.Date("2024-02-04"), "days"), seq(as.Date("2024-02-01"), as.Date("2024-02-05"), "days")) 
Goods <- c(rep("Food",3), "Toys", "Newspaper", rep("Food",3), "Toys")
Date_first_purchased <- c(rep("2024-02-01", 3), "2024-02-04", "2024-02-01", rep("2024-02-02", 3), "2024-02-05") %>% as.Date (., format="%Y-%m-%d") 
Date_replaced <- c(rep("2024-02-04", 3), NA, "2024-02-02", rep("2024-02-05", 3), NA) %>% as.Date (., format="%Y-%m-%d") 
Date_current_until <- c(rep("2024-02-04", 4), "2024-02-02", rep("2024-02-05", 4)) %>% as.Date (., format="%Y-%m-%d")

df <- data.frame(Customer_ID, Date, Goods, Date_first_purchased, Date_replaced, Date_current_until)

So far, I've tried the usual things when working on time series data, for instance, grouping data by ID and arranging them by Date within each ID. I also tried using lead () in combination with mutate but was not successful as it seems that lag() and lead() only support fixed numbers to find a previous or subsequent entry. However, this problem obviously requires a dynamic solution which I was not able to find yet. Maybe someone of you has tackled a similar problem previously. Any advice on how to do this is appreciated.


Solution

  • library(tidyverse)
    
    df %>%  
      mutate(Date_replaced = if_else(lead(Goods) != Goods,
                                     lead(Date), 
                                     max(Date[Goods == Goods])), 
             Date_current_until = if_else(row_number() == n(), 
                                          Date, 
                                          if_else(lead(Goods) != Goods,  
                                                  lead(Date), 
                                                  max(Date[Goods == Goods]))), 
             .by = Customer_ID) 
    
    # A tibble: 9 × 6
      Customer_ID Date       Goods     Date_first_purchased Date_replaced Date_current_until
            <dbl> <date>     <chr>     <date>               <date>        <date>            
    1           1 2024-02-01 Food      2024-02-01           2024-02-04    2024-02-04        
    2           1 2024-02-02 Food      2024-02-01           2024-02-04    2024-02-04        
    3           1 2024-02-03 Food      2024-02-01           2024-02-04    2024-02-04        
    4           1 2024-02-04 Toys      2024-02-04           NA            2024-02-04        
    5           2 2024-02-01 Newspaper 2024-02-01           2024-02-02    2024-02-02        
    6           2 2024-02-02 Food      2024-02-02           2024-02-05    2024-02-05        
    7           2 2024-02-03 Food      2024-02-02           2024-02-05    2024-02-05        
    8           2 2024-02-04 Food      2024-02-02           2024-02-05    2024-02-05        
    9           2 2024-02-05 Toys      2024-02-05           NA            2024-02-05