Search code examples
rdata.tablesemi-join

Perform a semi-join with data.table


How do I perform a semi-join with data.table? A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the rows of X to match the rows of Y. For example, the following code performs an inner join:

x <- data.table(x = 1:2, y = c("a", "b"))
setkey(x, x)
y <- data.table(x = c(1, 1), z = 10:11)

x[y]
#   x y  z
# 1: 1 a 10
# 2: 1 a 11

A semi-join would return just x[1]


Solution

  • More possibilities :

    w = unique(x[y,which=TRUE])  # the row numbers in x which have a match from y
    x[w]
    

    If there are duplicate key values in x, then that needs :

    w = unique(x[y,which=TRUE,allow.cartesian=TRUE])
    x[w]
    

    Or, the other way around :

    setkey(y,x)
    w = !is.na(y[x,which=TRUE,mult="first"])
    x[w]
    

    If nrow(x) << nrow(y) then the y[x] approach should be faster.
    If nrow(x) >> nrow(y) then the x[y] approach should be faster.

    But the anti anti join appeals too :-)