Search code examples
rdata.tableouter-join

Obtain one to one matches from many-to-many linkages, using data.table R


I am trying to link two sets of anonymous ID (ID1 and ID3), via a lookup to another ID (ID2). I want to keep only one-to-one matches between ID1 and ID3, but this is tricky as there are many-to-many matches between each ID.

library(data.table)
Table1 <- data.table(ID1 = c(1, 1, 2, 3, 4, 5, 5, 6),
                 ID2 = c(101, 102, 102, 103, 104, 105, 106, 107))

Table2 <- data.table(ID2 = c(101, 102, 103, 103, 104, 105, 106, 108),
                 ID3 = c(201, 202, 203, 204, 205, 206, 206, 207))

I have tried joining the tables. This is successful in getting rid of cases where ID2 only occurs in one of the tables.

setkey(Table1,ID2)
setkey(Table2,ID2)
merged_table <- Table2[Table1,nomatch=0]

However, it still contains cases where ID1 is linking to multiple ID3 values, and vice versa.

I want to end up with a table like this:

ID1    ID3
  4    205
  5    206

Ideally I would like to do this using data.table in R.


Solution

  • Vectorize the filtering on the joined table:

    unique(
      Table2[
        Table1, .(ID1, ID3), on = "ID2", nomatch=0,
        allow.cartesian=TRUE
      ]
    )[!(duplicated(ID1) | duplicated(ID1, fromLast = TRUE) |
          duplicated(ID3) | duplicated(ID3, fromLast = TRUE))]
    #>    ID1 ID3
    #> 1:   4 205
    #> 2:   5 206
    

    Timings

    Implement a few of the answers so far as functions:

    f1 <- function(Table1, Table2) {
      unique(
        Table2[
          Table1, .(ID1, ID3), on = "ID2", nomatch=0,
          allow.cartesian=TRUE
        ]
      )[!(duplicated(ID1) | duplicated(ID1, fromLast = TRUE) |
            duplicated(ID3) | duplicated(ID3, fromLast = TRUE))]
    }
    
    f2 <- function(Table1, Table2) {
      # adapted from rw2
      setkey(Table1, ID2)
      setkey(Table2, ID2)
      combined_lookups <- Table2[Table1, nomatch=0, allow.cartesian=TRUE]
      
      # Count the number of matches in both directions
      combined_lookups[, count1 := uniqueN(ID1), by = ID3]
      combined_lookups[, count2 := uniqueN(ID3), by = ID1]
      
      # Then remove cases where ID's match multiple person_ids, and vice versa. Also remove duplicate rows.
      unique(combined_lookups[count1==1 & count2==1][,.(ID1,ID3)])
    }
    
    f3 <- function(Table1, Table2) {
      # adapted from @s_baldur
      foo <- \(x) x %in% names(which(table(x) == 1L))
      
      Table2[Table1, on = "ID2", nomatch=0, .(ID1, ID3), allow.cartesian=TRUE
      ][, unique(.SD)
      ][foo(ID1) & foo(ID3)] 
    }
    

    Create a large dataset:

    Table1 <- data.table(ID1 = cumsum(runif(1e6)%/%0.5),
                         ID2 = cumsum(c(1e6, runif(999999)%/%0.5)))
    Table2 <- data.table(ID2 = cumsum(c(1e6, runif(999999)%/%0.5)),
                         ID3 = cumsum(c(2e6, runif(999999)%/%0.5)))
    

    Time the functions on the larger dataset:

    system.time(dt1 <- f1(Table1, Table2))
    #>    user  system elapsed 
    #>    0.70    0.01    0.49
    system.time(dt2 <- f2(Table1, Table2))
    #>    user  system elapsed 
    #>   14.89    3.58   18.05
    system.time(dt3 <- f3(Table1, Table2))
    #>    user  system elapsed 
    #>    5.69    0.08    5.36
    

    Check for equivalence:

    identical(dt1, dt2)
    #> [1] TRUE
    identical(dt1, dt3)
    #> [1] TRUE