Search code examples
rvectorcompareset-difference

All-to-all setdiff on two numeric vectors with a numeric threshold for accepting matches


What I want to do is more or less a combination of the problems discussed in the two following threads:

I have two numeric vectors:

b_1 <- c(543.4591, 489.36325, 12.03, 896.158, 1002.5698, 301.569)
b_2 <- c(22.12, 53, 12.02, 543.4891, 5666.31, 100.1, 896.131, 489.37)

I want to compare all elements in b_1 against all elements in b_2 and vice versa.

If element_i in b_1 is NOT equal to any number in the range element_j ± 0.045 in b_2 then element_i must be reported.

Likewise, if element_j in b_2 is NOT equal to any number in the range element_i ± 0.045 in b_1 then element_j must be reported.

Therefore, example answer based on the vectors provided above will be:

### based on threshold = 0.045
in_b1_not_in_b2 <- c(1002.5698, 301.569)
in_b2_not_in_b1 <- c(22.12, 53, 5666.31, 100.1)

Is there an R function that would do this?


Solution

  • If you are happy to use a non-base package, data.table::inrange is a convenient function.

    x1[!inrange(x1, x2 - 0.045, x2 + 0.045)]
    # [1] 1002.570  301.569
    
    x2[!inrange(x2, x1 - 0.045, x1 + 0.045)]
    # [1]   22.12   53.00 5666.31  100.10
    

    inrange is also efficient on larger data sets. On e.g. 1e5 vectors, inrange is > 700 times faster than the two other alternatives:

    n <- 1e5
    b1 <- runif(n, 0, 10000)
    b2 <- b1 + runif(n, -1, 1)
    
    microbenchmark(
      f1 = f(b1, b2, 0.045, 5000),
      f2 = list(in_b1_not_in_b2 = b1[sapply(b1, function(x) !any(abs(x - b2) <= 0.045))],
           in_b2_not_in_b1 = b2[sapply(b2, function(x) !any(abs(x - b1) <= 0.045))]),
      f3 = list(in_b1_not_in_b2 = b1[!inrange(b1, b2 - 0.045, b2 + 0.045)],
           in_b2_not_in_b1 = b2[!inrange(b2, b1 - 0.045, b1 + 0.045)]),
      unit = "relative", times = 10)
    # Unit: relative
    #  expr      min       lq     mean   median        uq       max neval
    #    f1 1976.931 1481.324 1269.393 1103.567 1173.3017 1060.2435    10
    #    f2 1347.114 1027.682  858.908  766.773  754.7606  700.0702    10
    #    f3    1.000    1.000    1.000    1.000    1.0000    1.0000    10
    

    And yes, they give the same result:

    n <- 100
    b1 <- runif(n, 0, 10000)
    b2 <- b1 + runif(n, -1, 1)
    
    all.equal(f(b1, b2, 0.045, 5000),
              list(in_b1_not_in_b2 = b1[sapply(b1, function(x) !any(abs(x - b2) <= 0.045))],
                   in_b2_not_in_b1 = b2[sapply(b2, function(x) !any(abs(x - b1) <= 0.045))]))
    # TRUE
    
    all.equal(f(b1, b2, 0.045, 5000),
              list(in_b1_not_in_b2 = b1[!inrange(b1, b2 - 0.045, b2 + 0.045)],
                   in_b2_not_in_b1 = b2[!inrange(b2, b1 - 0.045, b1 + 0.045)]))
    # TRUE
    

    Several related, potentially useful answers when searching for inrange on SO.