Search code examples
rdatetime-seriesdata.tablepanel-data

R: Insert missing dates in longitudinal data without losing information


I have a longitudinal dataset in a data table, similar to the simplified example below:

> head(data)
   Country     ID        Date         Value
1:   AT        AT6306    2012-11-01   16.2
2:   AT        AT6306    2012-11-02   12.2
3:   AT        AT6306    2012-11-03   11.3
4:   AT        AT6306    2012-11-04   14.2
5:   AT        AT6306    2012-11-05   17.3
6:   AT        AT6306    2012-11-06   12.5

> tail(data)
   Country     ID        Date         Value
1:   SE        SE0935    2014-06-25   16.2
2:   SE        SE0935    2014-06-26   12.2
3:   SE        SE0935    2014-06-27   11.3
4:   SE        SE0935    2014-06-28   14.2
5:   SE        SE0935    2014-06-29   17.3
6:   SE        SE0935    2014-06-30   12.5

ID is the panel variable, it is entirely unique without overlaps between countries. The date range, looking only at unique values, ranges from 2012-10-23 to 2014-09-30. Clearly, the ranges of Date are not identical for each ID. Moreover, there may be missing values. In order to have a balanced panel, I want to fill in the gaps of my dataset.

Adapting the answer here, as suggsted by @akron, I do the following:

data2 <- data[, CJ(ID=unique(ID), Date=unique(Date))]
setkey(data2, ID, Date)

data.new <- merge(data, data2, by=c("ID", "Date"), all.y = TRUE)
setkey(data.new, ID, Date)

Using the option all.y = TRUE, R thus adds rows for every missing date in data. However, now all fields other than ID and Date are blank, if the row did not exist before in data. That is, my data looks something like this

> head(data.new)
   Country     ID        Date         Value
1:   NA        AT6306    2012-10-23   NA
2:   NA        AT6306    2012-10-24   NA
3:   NA        AT6306    2012-10-25   NA
4:   NA        AT6306    2012-10-26   NA
5:   NA        AT6306    2012-10-27   NA
6:   NA        AT6306    2012-10-28   NA    

I do want the Value to be NA, since it is missing. However, since the Country does not change for a given ID, I would like the field to be filled.


Solution

  • library(data.table)
    DT <- data.table(dat)
    setkey(DT, Date, Country, ID)
    res <- DT[CJ(seq(min(Date), max(Date), by='1 day'), 
                            unique(Country), unique(ID))]
    
     head(res)
    #    Country   ID       Date Value
    #1:      AT  935 2012-11-01    NA
    #2:      AT 6306 2012-11-01  16.2
    #3:      SE  935 2012-11-01    NA
    #4:      SE 6306 2012-11-01    NA
    #5:      AT  935 2012-11-02    NA
    #6:      AT 6306 2012-11-02  12.2
    

    Update

    One option you could do is

    DT <- data.table(dat)
    DT[,CountryID:= paste(Country,ID)]
    setkey(DT, Date, CountryID)
    DT1 <- DT[CJ(unique(Date), unique(CountryID))][,
          c('Country', 'ID'):=  list(gsub("[ 0-9]", "", CountryID),
                   gsub("[^ 0-9]", "", CountryID)),][,-5L]
    
    
    head(DT1,3)
    #     Country    ID       Date Value
    #1:      AT  6306 2012-11-01  16.2
    #2:      SE   935 2012-11-01    NA
    #3:      AT  6306 2012-11-02  12.2
    
    nrow(DT1)
    #[1] 24
    

    data

    dat <- structure(list(Country = c("AT", "AT", "AT", "AT", "AT", "AT", 
    "SE", "SE", "SE", "SE", "SE", "SE"), ID = c(6306L, 6306L, 6306L, 
    6306L, 6306L, 6306L, 935L, 935L, 935L, 935L, 935L, 935L), Date = structure(c(15645, 
    15646, 15647, 15648, 15649, 15650, 15669, 15670, 15671, 15672, 
    15673, 15674), class = "Date"), Value = c(16.2, 12.2, 11.3, 14.2, 
    17.3, 12.5, 16.2, 12.2, 11.3, 14.2, 17.3, 12.5)), .Names = c("Country", 
    "ID", "Date", "Value"), row.names = c("1:", "2:", "3:", "4:", 
    "5:", "6:", "1:1", "2:1", "3:1", "4:1", "5:1", "6:1"), class = "data.frame")