Search code examples
rjoindata.table

Join when the same keys are set but in different order


I am not understanding why the order of keys in data tables being joined makes a difference.

Consider the following (expected) result of joining tmp_dt_1 to tmp_dt_2:

tmp_dt_1 <- 
    data.table(a = c(1, 2), b = c(2, 1), c = 1:2)

tmp_dt_2 <- 
    data.table(a = c(1, 2), b = c(2, 1), d = 1:2)


setkeyv(tmp_dt_1, c("a", "b"))


setkeyv(tmp_dt_2, c("a", "b"))


tmp_dt_1

# Key: <a, b>
#     a     b     c
#     <num> <num> <int>
# 1:     1     2     1
# 2:     2     1     2


tmp_dt_2

# Key: <a, b>
#     a     b     d
#     <num> <num> <int>
# 1:     1     2     1
# 2:     2     1     2


tmp_dt_1[tmp_dt_2]
# Key: <a, b>
#     a     b     c     d
#     <num> <num> <int> <int>
# 1:     1     2     1     1
# 2:     2     1     2     2

Now, key tmp_dt_2 with column b followed by a. This produces a join that is different (and wrong) compared to the above:

setkeyv(tmp_dt_2, c("b", "a"))

tmp_dt_1
# Key: <a, b>
#     a     b     c
#   <num> <num> <int>
# 1:     1     2     1
# 2:     2     1     2


tmp_dt_2

# Key: <b, a>
#     a     b     d
#   <num> <num> <int>
# 1:     2     1     2
# 2:     1     2     1



tmp_dt_1[tmp_dt_2]

# Key: <a, b>
#     a     b     c     d
#     <num> <num> <int> <int>
# 1:     1     2     1     2
# 2:     2     1     2     1

Solution

  • In What are primary and secondary indexes in data.table?, the data.table docs say:

    setkey(DT, col1, col2) orders the rows by column col1 then within each group of col1 it orders by col2. This is a primary index. The row order is changed by reference in RAM... We use the words index and key interchangeably.

    (Emphasis mine)

    The keys determine row order in RAM, which allows for low-level optimisations that take advantage of this order. For example, grouping operations work on data that is contiguous in RAM, which can minimise page fetches and copy memory in bulk.

    This means that order matters. So when you specify the keys in the order that you do in the second case, you are actually doing this:

    tmp_dt_1[
        tmp_dt_2,
        on = .(a = b, b = a)
    ]
    

    We can see this more clearly if we only set one key:

    setkey(tmp_dt_1, a)
    setkey(tmp_dt_2, b)
    all(
        tmp_dt_1[tmp_dt_2] == 
            tmp_dt_1[tmp_dt_2, on = .(a)]
    ) # FALSE
    
    all(
        tmp_dt_1[tmp_dt_2] == 
            tmp_dt_1[tmp_dt_2, on = .(a = b)]
    ) # TRUE
    

    If you're concerned about this, the easiest thing would be to set keys using a vector which is the same for both data.tables e.g.

    keys <- c("a", "b")
    setkeyv(tmp_dt_1, keys)
    setkeyv(tmp_dt_2, keys)
    

    Having said that, you should not generally set a key if all you are going to do is join. There is no performance advantage unless you're going to do repeated operations grouped by the primary key. For more about what happens when you set a key, see What is the purpose of setting a key in data.table? It's worth noting that that answer, written by one of the authors of data.table, says:

    In most cases therefore, there shouldn't be a need to set keys anymore. We recommend using on= wherever possible, unless setting key has a dramatic improvement in performance that you'd like to exploit.