Search code examples
rjoindata.tablecartesian-product

data.table cartesian join warning on legitimate join


I am sorry in advance for the long post which mixes several questions. If not appropriate, please edit or advise what I should do. I am practicing data.table join, here is an imaginary scenario:

"Two robots, each has 4 positions in MovementA and 4 positions in MovementB. Problem to Solve: for each robot, from MoveA to MoveB, there are 4x4 possible Position-pairs Find the 4 pairs with the shortest distance"

Data Setup

library(data.table)

set.seed(20141220)
dtMoveA = data.table(RobotID = rep(1:2, each=4), Position=sample(1:20, 8))
dtMoveB = data.table(RobotID = rep(1:2, each=4), Position=sample(1:20, 8))

# Review Data
rbind(cbind(Movement="Move-A", dtMoveA), cbind(Movement="Move-B", dtMoveB))

    Movement RobotID Position
 1:   Move-A       1       18
 2:   Move-A       1       20
 3:   Move-A       1       15
 4:   Move-A       1        8
 5:   Move-A       2       13
 6:   Move-A       2        2
 7:   Move-A       2        9
 8:   Move-A       2       12
 9:   Move-B       1       18
10:   Move-B       1       14
11:   Move-B       1       13
12:   Move-B       1       17
13:   Move-B       2        5
14:   Move-B       2       16
15:   Move-B       2       20
16:   Move-B       2        3

Solution 1 (using dplyr)

library(dplyr)

dtMoveA %>%
    inner_join(dtMoveB, by="RobotID") %>%
    mutate(AbsDistance = abs(Position.x - Position.y)) %>%
    group_by(RobotID, Position.x) %>%
    filter(AbsDistance == min(AbsDistance)) %>%
    arrange(RobotID, Position.x)

  RobotID Position.x Position.y AbsDistance
1       1          8         13           5
2       1         15         14           1
3       1         18         18           0
4       1         20         18           2
5       2          2          3           1
6       2          9          5           4
7       2         12         16           4
8       2         13         16           3

(attempt) Solution 2 (using data.table)

setkey(dtMoveA, RobotID)
setkey(dtMoveB, RobotID)

dtMoveA[dtMoveB, .(RobotID, Position, i.Position,
                         AbsDistance = abs(Position - i.Position)), allow.cartesian=TRUE
    ] [, MinDistance := min(AbsDistance), by=list(RobotID, Position)
    ] [ AbsDistance == MinDistance, .(Position, i.Position, AbsDistance), by=RobotID
    ] [ order(RobotID, Position)]

   RobotID Position i.Position AbsDistance
1:       1        8         13           5
2:       1       15         14           1
3:       1       18         18           0
4:       1       20         18           2
5:       2        2          3           1
6:       2        9          5           4
7:       2       12         16           4
8:       2       13         16           3

Question 1 Can you please correct my Solution2 with the good practices from data.table art?

Question 2 without the parameter allow.cartesian=TRUE data.table warns "Join results in 32 rows; more than 8 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE"

Is it really a cartesian product? Here the join is made only on the common key values, it is just a coincidence in the data which yields a big join results.

Question 3 dtMoveA and dtMoveB have same column names. datatable join makes the distinction by changing the name to i.Position. Is the "i" prefix something hardcoded? And I suppose i.ColumnName always applies to Y member in the X[Y] join expression.

Thanks in advance for any help.


Solution

  • Question 1:

    Looks great! But the equivalent of inner join is to also add nomatch=0L. Otherwise you'd also get all the rows from dtMoveB. We can't make use of by=.EACHI here AFAICT.

    Read this answer and the answer linked under this comment for understanding the purpose of allow.cartesian = TRUE.

    Question 2: From ?data.table under the entry for allow.cartesian:

    FALSE prevents joins that would result in more than max(nrow(x), nrow(i)) rows. This is usually caused by duplicate values in i's join columns, each of which join to the same group in x over and over again: a misspecified join.

    Usually this was not intended and the join needs to be changed. The word 'cartesian' is used loosely in this context. The traditional cartesian join is (deliberately) difficult to achieve in data.table: where every row in i joins to every row in x (a nrow(x) * nrow(i) row result). 'cartesian' is just meant in a 'large multiplicative' sense.

    Does this answer your question?

    Question 3:

    Yes. Joins are of the form x[i]. When x and i both share a column name and will be in the joined result, those columns have a i. prefix added to it. It's the same prefix that also allows you to access i's columns in j, where both x and i share the column name in operations of the form x[i, j] or x[i, j, by=.EACHI].

    While joining you can change the names to whatever you like though. Here, you could change to position.x and position.y with:

    dtMoveA[dtMoveB, .(RobotID, Position.x=Position, Position.y=i.Position,
                     AbsDistance = abs(Position - i.Position)), allow.cartesian=TRUE]
    

    HTH


    PS: If you've any suggestions, feel free to add a FR here. Please have a look at the posting guidelines before to do so.