Search code examples
rdata.tablenacoalesceimputation

batch store non-missing value in a new column in data.table by a reference data.table


I have a dt.2 with columns including a, b, c, x, d and e. Some columns actually have the same meaning and could be used for imputation. For example, a, b and c equal to x while d and e equal to y. These replationships are recorded in another table dt. I want to impute NA value to x and y by associated columns respectively.

In order to do this in batch, I created a 2 dimention list(object result) based on Darren Tsai's answer here Turn a datatable into a two-dimensional list in R

Then, I intended to use a for loop but kept getting error. Not sure if my direction was correct. Or if I overcomplicated this problem. Thanks!

library(data.table)
# data.table of reference
dt <- data.table(col1 = c("a", "b", "c", "d", "e"),
                 col2 = c("x", "x", "x", "y", "y"))

# data.table to work with
dt.2 <- data.table(a = c(1,  NA, 4),
                   b = c(NA, NA, 2),
                   c = c(NA, NA, NA),
                   x = c(NA, 3,  NA),
                   d = c(5,  6,  NA),
                   e = c(NA, NA, 7))

# my current code
lst1 <- split(dt$col1, dt$col2)
lst2 <- as.list(names(lst1))
result <- list(unname(lst1), lst2)

for(i in 1:2){
  new.col <- result[[2]][i][[1]]
  old.cols <- unique(result[[1]][i][[1]])

  dt.2[, new.col := fcoalesce(old.cols)]
}

Solution

  • Assuming that what you need is for x to be updated and for y to be created based on the existing values of the other column names.

    First, c is logical when it needs to be as.numeric. A literal fix is merely

    dt.2[, c := as.numeric(c)]
    

    Try this:

    spl <- split(dt$col1, dt$col2)
    spl <- Map(c, names(spl), spl)
    spl
    # $x
    # [1] "x" "a" "b" "c"
    # $y
    # [1] "y" "d" "e"
    
    dt.2[, names(spl) := lapply(spl, function(cols) do.call(fcoalesce, .SD[, intersect(cols, names(.SD)), with = FALSE]))]
    dt.2
    #        a     b     c     x     d     e     y
    #    <num> <num> <num> <num> <num> <num> <num>
    # 1:     1    NA    NA     1     5    NA     5
    # 2:    NA    NA    NA     3     6    NA     6
    # 3:     4     2    NA     4    NA     7     7