Search code examples
rdategroup-bydata.tablerow

Fill in missing rows for dates by group


I have a data table like this, just much bigger:

customer_id <- c("1","1","1","2","2","2","2","3","3","3")
account_id <- as.character(c(11,11,11,55,55,55,55,38,38,38))
time <- c(as.Date("2017-01-01","%Y-%m-%d"), as.Date("2017-05-01","%Y-%m- 
%d"), as.Date("2017-06-01","%Y-%m-%d"),
          as.Date("2017-02-01","%Y-%m-%d"), as.Date("2017-04-01","%Y-%m- 
%d"), as.Date("2017-05-01","%Y-%m-%d"),
          as.Date("2017-06-01","%Y-%m-%d"), as.Date("2017-01-01","%Y-%m- 
%d"), as.Date("2017-04-01","%Y-%m-%d"),
          as.Date("2017-05-01","%Y-%m-%d"))


tenor <- c(1,2,3,1,2,3,4,1,2,3)
variable_x <- c(87,90,100,120,130,150,12,13,15,14)

my_data <- data.table(customer_id,account_id,time,tenor,variable_x)

customer_id account_id       time tenor variable_x
          1         11 2017-01-01     1         87
          1         11 2017-05-01     2         90
          1         11 2017-06-01     3        100
          2         55 2017-02-01     1        120
          2         55 2017-04-01     2        130
          2         55 2017-05-01     3        150
          2         55 2017-06-01     4         12
          3         38 2017-01-01     1         13
          3         38 2017-04-01     2         15
          3         38 2017-05-01     3         14

in which I should observe for each pair of customer_id, account_id monthly observations from 2017-01-01 to 2017-06-01, but for some customer_id, account_id pairs some dates in this sequence of 6 months are missing. I would like to fill in those missing dates such that each customer_id, account_id pair has observations for all 6 months, just with missing variables tenor and variable_x. That is, it should look like this:

    customer_id account_id       time tenor variable_x
           1         11    2017-01-01     1         87
           1         11    2017-02-01    NA         NA
           1         11    2017-03-01    NA         NA
           1         11    2017-04-01    NA         NA
           1         11    2017-05-01     2         90
           1         11    2017-06-01     3        100
           2         55    2017-01-01    NA         NA
           2         55    2017-02-01     1        120
           2         55    2017-03-01    NA         NA
           2         55    2017-04-01     2        130
           2         55    2017-05-01     3        150
           2         55    2017-06-01     4         12
           3         38    2017-01-01     1         13
           3         38    2017-02-01    NA         NA
           3         38    2017-03-01    NA         NA
           3         38    2017-04-01     2         15
           3         38    2017-05-01     3         14
           3         38    2017-06-01    NA         NA

I tried creating a sequence of dates from 2017-01-01 to 2017-06-01 by using

ts = seq(as.Date("2017/01/01"), as.Date("2017/06/01"), by = "month")

and then merge it to the original data with

ts = data.table(ts)
colnames(ts) = "time"
merged <- merge(ts, my_data, by="time", all.x=TRUE)

but it is not working. Please, do you know how to add such rows with dates for each customer_id, account_id pair?


Solution

  • We can do a join. Create the sequence of 'time' from min to max by '1 month', expand the dataset grouped by 'customer_id', 'account_id' and join on with those columns and the 'time'

    ts1 <- seq(min(my_data$time), max(my_data$time), by = "1 month")
    my_data[my_data[, .(time =ts1 ), .(customer_id, account_id)], 
                 on = .(customer_id, account_id, time)]
    #    customer_id account_id       time tenor variable_x
    # 1:           1         11 2017-01-01     1         87
    # 2:           1         11 2017-02-01    NA         NA
    # 3:           1         11 2017-03-01    NA         NA
    # 4:           1         11 2017-04-01    NA         NA
    # 5:           1         11 2017-05-01     2         90
    # 6:           1         11 2017-06-01     3        100
    # 7:           2         55 2017-01-01    NA         NA
    # 8:           2         55 2017-02-01     1        120
    # 9:           2         55 2017-03-01    NA         NA
    #10:           2         55 2017-04-01     2        130
    #11:           2         55 2017-05-01     3        150
    #12:           2         55 2017-06-01     4         12
    #13:           3         38 2017-01-01     1         13
    #14:           3         38 2017-02-01    NA         NA
    #15:           3         38 2017-03-01    NA         NA
    #16:           3         38 2017-04-01     2         15
    #17:           3         38 2017-05-01     3         14
    #18:           3         38 2017-06-01    NA         NA
    

    Or using tidyverse

    library(tidyverse)
    distinct(my_data, customer_id, account_id) %>%
          mutate(time = list(ts1)) %>% 
          unnest %>% 
          left_join(my_data)
    

    Or with complete from tidyr

    my_data %>% 
         complete(nesting(customer_id, account_id), time = ts1)