Search code examples
rdata.tablememory-efficientright-join

R data.table update join by reference the, but updating the RIGHT table


Context: The last section of the amazing new Joins in data.table vignette explains how to use the merge sintax ( x[i,on=.(id),...]) to update the LEFT table by reference (also described in this SO question). Here is a simplified example:

x = data.table(id = 1:5, newvar1=LETTERS[1:5])
               #In practice x would have more vars: newvar2, ..., newvarN
 i = data.table(id = 1:7, var1 = c('bla','ble','bli','blo','blu','blA','blS') )

#Updating the LEFT table by reference
x[i,on = .(id),
  j = `:=`(id=id,var1=var1,newvar1=newvar1)]

#Result: the column var1 (from i) is added to x by reference
print(x)
> print(x)
id newvar1   var1
<int> <char> <char>
1:     1      A    bla
2:     2      B    ble
3:     3      C    bli
4:     4      D    blo
5:     5      E    blu

I need to do the opposite: update i by reference.

x = data.table(id = 1:5, newvar1=LETTERS[1:5])
               #In practice x would have more vars: newvar2, ..., newvarN 
i = data.table(id = 1:7,var1 = c('bla','ble','bli','blo','blu','blA','blS'))

#Right join then overwrite the content of i
i <- x[i,on = .(id)]

Is there a way to do that by reference?

In my application (and I assume for several DT heavy users), i is the main dataset, with millions of observations and dozens of columns. That is, the dataset I want to keep adding new variables to, such as newvar1,newvar2,...,newvarN coming from x. In doing that, I need to preserve the columns and rows (cardinality) of i, not x.

Overwriting i, AFAIK, involves making a copy of i, which causes a memory spikes (~ doubles) which may exceed the available ram and crash R.

My current workaround is this:

i_id <- i[,.(id)]     # create an auxiliary data.table with just the 'id' var
i_id <- x[i_id,on=id] #the still making the copy, but with just the 'id' columns + columns from d (instead of hundreds of columns from i) 
i[,newvar1:=i_id$newvar1]
i[,newvar2:=i_id$newvar2]
...
i[,newvarN:=i_id$newvarN]   #This works, but the code is quite large (if N is large), verbose and error prone

Update1: I added a related issue on data.table's github


Solution

  • Would this work?

    cols <- c("newvar1", "newvar2") # or setdiff(names(x), "id")
    i[, (cols) := x[.SD, on = "id", .SD, .SDcols = cols]]
    # or to avoid the double .SD.
    # i[, (cols) := x[i, on = "id", .SD, .SDcols = cols]]
    # i[, (cols) := x[.SD, on = "id", mget(cols)]] 
    
    i
    #       id   var1 newvar1 newvar2
    #    <int> <char>  <char>  <char>
    # 1:     1    bla       A       F
    # 2:     2    ble       B       G
    # 3:     3    bli       C       H
    # 4:     4    blo       D       I
    # 5:     5    blu       E       J
    # 6:     6    blA    <NA>    <NA>
    # 7:     7    blS    <NA>    <NA>
    

    Using

    x <- data.table(id = 1:5, newvar1 = LETTERS[1:5], newvar2 = LETTERS[6:10])