Search code examples
rjoindata.table

Understanding data.table left join


I read in the data.table documentation https://cran.r-project.org/web/packages/data.table/data.table.pdf the following:

X[DT, on="x"] # left join

And, using this example:

A <- data.table("a"=1:4, "b"=12:15)
B <- data.table("a"=2:3, "b"=13:14)
B[A, on='a']

I get:

   a  b i.b
1: 1 NA  12
2: 2 13  13
3: 3 14  14
4: 4 NA  15

I would like to understand what is happening under the hood here. What is the order of things? In my head we are working inside the scope of B in the join above, and what is happening is that we are for every row in A checking if there is a match in B on the on criteria. If there is a match, or multiple matches, this/these index/indeces are "stored", and we move on to the next row of B and so on. But I just don't understand what happens when there is no match. Why would we "bring" a row from A (first and last row) when we are operating within the scope or context of B? Is a join like this more than just a matching of whatever is in on in order to return some index/indeces which then are applied to B?

UPDATE: We need to understand what is going on when doing these operations. To me, not understanding what is going on behind the scenes of these operations is highly problematic. The just use merge.data.table instead just goes to show what big problem this is. There has supposed to be a vignette about joins, but for years nothing has happened (source: https://github.com/Rdatatable/data.table/issues/2181). For example, I have never seen anybody actually explain what is going on in this example:

A <- data.table("a"=1:4, "b"=12:15)
B <- data.table("a"=c(2:3,6), "b"=13:15)
# A left join B
B[A, on='a']
# and this is also A left join B, where last match duplicates are kept
A[B, on = 'a', bb := i.b]

What, on, earth, is, happening?


Solution

  • This answer here by @MichaelChirico is actually trying to explain what is going on under the hood: data.table join is hard to understand

    Using:

    library(data.table)
    DT = data.table(x = rep(c("b", "a", "c"), each = 3),
                    y = c(1, 3, 6),
                    v = 1:9)
    X = data.table(x = c("c", "b"),
                   v = 8:7,
                   foo = c(4, 2))
    

    He says:

    Note also that we are using the right table to "look up" rows of the left table (in DT[X, on = .(x, y <= foo)]). That means we go through the rows of X and see which rows of DT match.

    This is where me saying we are working inside the scope of the left table comes from. But it doesn't explain what happens when there is no index found. Thankfully he explained that in the comment:

    That case is governed by the nomatch= parameter. it will "fill out" unmatched rows by putting missing for all the DT columns. that's like x=1:10; x[11]: an unmatched index is requested, NA is returned.

    So, when there is no match, missing values for the DT columns are returned, and the non-missing values for the X columns are kept.

    But what about A[B, on = 'a', bb := i.b], what happens here? The same logic as before, now we decide to update A by reference using :=. If there are duplicate indeces returned (because there are multiple matches in on), := chooses the last matched index. If there is no-match for an index i.b will just set to NA. So, this is not really A left join B, because there can't be more rows in the result than the amount of rows from A.