Search code examples
rdata.table

three data.table merge behavior inconsistency


I've been searching around this morning to try to figure out if the failure below is expected but haven't found anything. Could anyone help point me to a related discussion? Otherwise, I might submit as an issue. Appreciate it.

library(data.table)

x <- data.table( a = 1:3 )
y <- data.table( a = 2:4 )
z <- data.table( a = 3:5 )

# works
merge( x , y )
# works
merge( y , z )

# fails
merge( x , merge( y , z ) )
# Error in merge.data.table(x, merge(y, z)) :
#   A non-empty vector of column names for `by` is required.

# works
merge( merge( x , y ) , z )

Solution

  • This is a clear bug. Please report it. Luckily, it should be easy to fix.

    merge.data.table contains this code:

    if (is.null(by)) 
      by = intersect(key(x), key(y))
    if (is.null(by)) 
      by = key(x)
    if (is.null(by)) 
      by = intersect(names(x), names(y))
    

    Now, the issue is that y is keyed (because merge.data.table sets a key):

    x <- data.table( a = 1:3 )
    y <- merge(data.table( a = 2:4 ), data.table( a = 3:5 ))
    haskey(y)
    #[1] TRUE
    

    Then,

    intersect(key(x), key(y))
    #character(0)
    

    Thus, none of the following if conditions is TRUE (we would want the third one to apply here).

    This doesn't happen in your last case because of this:

    intersect("foo", NULL)
    #NULL
    intersect(NULL, "foo")
    #character(0)