Search code examples
rdata.tablemissing-data

merge two sets of observations to fill out missing data - R


So I have a dataset that contains duplicate entries, which have the same ID1 but different ID2, and some of the data is missing. I want to generate a complete data set by moving the missing observations onto a single of the entities and then dropping the others.

Some example test data to illustrate the issue:

library(data.table)
    
tab = data.table(ID1 = rep("A", 9),
               ID2 = c(rep("AA", 3), rep("AB", 3), rep("AC", 3)),
               year = c(2010, 2011, 2012, 2010, 2011, 2012, 2010, 2011, 2012),
               var1 = c(100, NA, 150, 100, 120, NA, NA, NA, NA),
               var2 = c(NA, NA, NA, 4, 5, NA, NA, 5, 2),
               var3 = c(4.1, 3.1, 2.7, NA, 3.1, NA, 4.1, NA, NA))

I have already ranked the entities by the amount of missing data the have, but I need a way to move the missing data across to fill in the gaps. For example: move the missing var1 and var2 data from AB to AA.

I have up to 5 duplicates and my data is panel (time x entities).

Many thanks


Solution

  • If you are trying to move data by year, and the data are ordered as above, you can use nafill with type="nocb":

    v = c("var1", "var2", "var3")
    tab[, (v):=lapply(.SD, nafill, type="nocb"), .(ID1, year), .SDcols = v][, first(.SD), .(ID1,year)]
    

    Output:

          ID1  year    ID2  var1  var2  var3
       <char> <num> <char> <num> <num> <num>
    1:      A  2010     AA   100     4   4.1
    2:      A  2011     AA   120     5   3.1
    3:      A  2012     AA   150     2   2.7