Search code examples
rperformancedatatable

Speed up setdiff(), intersect(), union() operations on vectors in R


I have a data table dt1:

id1 id2 V1 V2
1 a c(1, 2, 3, 4) c(1, 3, 6)
2 b c(2, 6, 9, 8) c(8, 5)

I want to add new columns which are the result of the setdiff(), intersect() and union() operations on the V1 and V2 variables.

Expected output:

id1 id2 V1 V2 diff_V1_V2 intersect_V1_V2 union_V1_V2
1 a c(1, 2, 3, 4) c(1, 3, 6) c(2, 4) c(1, 3) c(1, 2, 3, 4, 6)
2 b c(2, 6, 9, 8) c(8, 5) c(2, 6, 9) c(8) c(2, 5, 6, 8, 9)

I tried:

dt_new <- dt1[, c("diff_V1_V2", "intersect_V1_V2", "union_V1_V2") := list(
              Map(setdiff, V1, V2),
              Map(intersect, V1, V2),
              Map(union, V1, V2))]

But my real vectors are very long, so these operations take a very long time.

So, how can these operations be sped up or how to get similar results using another functions/approaches? I'm looking for the most efficient way.
Or is it possible to parallelize calculations?


Solution

  • Naive approach:

    Naive <- function (a, b) {
      list(intersect = intersect(a, b),
           union = union(a, b),
           adiffb = setdiff(a, b))
    }
    

    You can exploit basic mathematics to complete all three operations in one scan of vectors a and b, instead of three scans by three function calls. In addition, you can skip the expensive unique if you know for sure that there are no duplicated values in either vector.

    SetOp <- function (a, b, no.dup.guaranteed = FALSE) {
      if (no.dup.guaranteed) {
        au <- a
        bu <- b
      } else {
        au <- unique(a)
        bu <- unique(b)
      }
      ind <- match(bu, au, nomatch = 0)
      INTERSECT <- au[ind]
      DIFF <- au[-c(ind, length(au) + 1)]  ## https://stackoverflow.com/a/52772380
      UNION <- c(bu, DIFF)
      list(intersect = INTERSECT, union = UNION, adiffb = DIFF)
    }
    
    SetOp(a = c(1, 2, 3, 4), b = c(1, 3, 6))
    SetOp(a = c(2, 6, 9, 8), b = c(8, 5))
    

    Some benchmark:

    ## no duplicated values in either a or b; can set no.dup.guaranteed = TRUE
    a <- sample.int(11000, size = 10000, replace = FALSE)
    b <- sample.int(11000, size = 10000, replace = FALSE)
    
    microbenchmark::microbenchmark(naive = Naive(a, b),
                                   better = SetOp(a, b),
                                   fly = SetOp(a, b, no.dup.guaranteed = TRUE))
    #Unit: milliseconds
    #   expr      min       lq     mean   median       uq      max neval
    #  naive 6.457302 6.489710 6.751996 6.511399 6.567941 8.571623   100
    # better 3.251701 3.268873 3.377910 3.277086 3.306723 3.880755   100
    #    fly 1.734898 1.749300 1.805163 1.755927 1.767114 3.326179   100
    
    ## lots of duplicated values in both a and b; must have no.dup.guaranteed = FALSE
    a <- sample.int(100, size = 10000, replace = TRUE)
    b <- sample.int(100, size = 10000, replace = TRUE)
    
    microbenchmark::microbenchmark(naive = Naive(a, b),
                                   better = SetOp(a, b))
    #Unit: microseconds
    #   expr      min       lq      mean   median       uq      max neval
    #  naive 1421.702 1431.023 1653.1339 1443.147 1483.255 3809.031   100
    # better  396.193  398.695  446.7062  400.293  412.046 1995.294   100
    

    If you want further speedup, you need to think about how to speed up unique(). This is not easy, because you probably can't beat the internal algorithms used by R.


    I see additional speed improvement replacing unique() with the R fast Rfast::sort_unique().

    Thank you, @M.Viking. It is nice to see your answer. I have not got time to install GNU Scientific Library (GSL) on my operation system so haven’t been able to install and try Rfast myself.

    Here are some comments on your benchmark result.

    • The reason why “better2” is faster than “fly”, is because match is faster on sorted vectors. So yes, even if there are no duplicated values in a and b, it is still a good idea to apply sort_unique.

    • You may also want to try the Match function from Rfast. I spotted this function in the package's documentation, but don't know how fast it is compared with R's base version. Also, the documentation does not state clearly how Match handles no-match. By contrast, the base version match has a useful argument nomatch which I set to 0 to avoid NA indexing.

    Good ideas. Unfortunately Rfast::Match() is not a drop in replacement for base::match(). However, luckily, fastmatch::fmatch() is a fast drop in replacement for match().

    We are having a really inspiring iteration here! That is good to know! Amazing R community with so many useful tools!


    My V1 and V2 variables do not contain duplicates, so there is no need to use the unique function, if I understand correctly?

    Intuitively, yes, because we don't want to do extra work. But interestingly, the benchmark result in M.Viking's answer shows that it is worth sorting vectors to speedup match. So you may actually take the SetOp2() given by M.Viking.

    I don't think SetOp3() with base::match replaced by fastmatch::fmatch worthwhile in your application. According to its documentation, fmatch is faster than match only when we perform repeated matching like match(a, key), match(b, key), etc., where key is reused. The benchmark by M.Viking favors this, because microbenchmark() repeats SetOp3(a, b) for 100 times. In the first run, fmatch is as fast as match; then it is much faster than match in the following 99 runs. In your application, however, each vector is used only once. Since no reuse exists, we are good to stay with match.

    So, how can I apply your solution to each row of my data (V1 and V2 are lists there)?

    We have to use a loop or loop-like function, like the Map you used in your question. The only issue is that we need some post-processing to extract the results. See below.

    V1 <- list(a1 = c(1, 2, 3, 4), a2 = c(2, 6, 9, 8))
    V2 <- list(b1 = c(1, 3, 6), b2 = c(8, 5))
    
    ## or: ans <- Map(SetOp2, V1, V2)
    ans <- Map(SetOp, V1, V2, no.dup.guaranteed = TRUE)
    ## post-processing
    INTERSECT <- lapply(ans, "[[", 1)
    UNION <- lapply(ans, "[[", 2)
    SETDIFF <- lapply(ans, "[[", 3)
    

    Additional thoughts on parallel processing

    jblood94's answer takes a further step by parallel computing. Good job! It is a great exercise for practicing parallel. Yet in principle, we don't want to parallel this task, because it is memory-bound rather than CPU-bound. We are just scanning data from memory without doing sophisticated CPU arithmetic. It is known that parallel processing is not promising for this kind of job and I don't expect a big speedup. It seems that jblood94 is able to get 82.89 / 34.21 = 2.42 speedup against serial processing. However, he/she does not mention how many CPU cores are used. For example, if 8 cores are exploited, then 2.42 speedup is pretty poor.