Search code examples
roptimizationcombinations

Optimizing calculation of combinations into list - large data set


I wonder if someone can figure out a faster way to calculate combinations of elements in vector. My approach works but is slow with about 6 million elements in the vector.

Test vector

test.vector <- c("335261 344015 537633","22404 132858","254654 355860 488288","219943 373817","331839 404477")

My approach

lapply(strsplit(test.vector, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))

Expected output

[[1]]
[1] "335261344015" "335261537633" "344015537633"

[[2]]
[1] "22404132858"

[[3]]
[1] "254654355860" "254654488288" "355860488288"

[[4]]
[1] "219943373817"

[[5]]
[1] "331839404477"

Solution

  • Here is an answer that is much faster than the OP’s solution on large test cases. It doesn’t rely on paste, but rather we take advantage of properties of numbers and vectorized operations. We also use comboGeneral from the RcppAlgos package (I am the author) which is much faster than combn and combn_prim from the linked answer for generating combinations of a vector. First we show the efficiency gains of comboGeneral over the other functions:

    ## library(gRbase)
    library(RcppAlgos)
    library(microbenchmark)
    options(digits = 4)
    options(width = 90)
    options(scipen = 99)
    
    microbenchmark(
        gRBase = gRbase::combn_prim(300, 2),
        utils  = combn(300, 2),
        RcppAlgos = comboGeneral(300, 2),
        unit = "relative"
    )
    #> Warning in microbenchmark(gRBase = gRbase::combn_prim(300, 2), utils = combn(300, : less
    #> accurate nanosecond times to avoid potential integer overflows
    #> Unit: relative
    #>       expr   min     lq  mean median     uq     max neval
    #>     gRBase 104.6  93.79 122.1  91.82  88.26 1259.00   100
    #>      utils 141.8 127.72 126.2 125.70 128.44   88.04   100
    #>  RcppAlgos   1.0   1.00   1.0   1.00   1.00    1.00   100
    

    Now, we create a function to create some random reproducible data that will be passed to our test functions:

    makeTestSet <- function(vectorSize, elementSize, mySeed = 42, withRep = FALSE) {
        set.seed(mySeed)
        sapply(1:vectorSize, function(x) {
            paste(sample(10^6, s1 <- sample(2:elementSize, 1),
                         replace = withRep), collapse = " ")
        })
    }
    
    makeTestSet(5, 3)
    #> [1] "280897 566346"        "74362 46208 964632"   "226905 326254"       
    #> [4] "626546 926595 586329" "222181 946797"
    

    That looks good. Now, lets see if setting fixed = TRUE gets us any gains (as suggested above by @MichaelChirico):

    bigVec <- makeTestSet(10, 100000)
    
    microbenchmark(standard = strsplit(bigVec, " "),
                   withFixed = strsplit(bigVec, " ", fixed = TRUE),
                   times = 15, unit = "relative")
    #> Unit: relative
    #>       expr  min    lq  mean median    uq   max neval
    #>   standard 6.18 6.134 5.197  6.048 5.678 2.921    15
    #>  withFixed 1.00 1.000 1.000  1.000 1.000 1.000    15
    

    @MichaelChirico was spot on. Putting it all together we get:

    combPairFast <- function(testVec) {
        lapply(strsplit(testVec, " ", fixed = TRUE), function(x) {
            combs <- RcppAlgos::comboGeneral(as.numeric(x), 2)
            unique(combs[,1] * (10)^(as.integer(log10(combs[,2])) + 1L) + combs[,2])
        })
    }
    
    ## test.vector defined above by OP
    test.vector <- c(
        "335261 344015 537633", "22404 132858",
        "254654 355860 488288", "219943 373817",
        "331839 404477"
    )
    
    combPairFast(test.vector)
    #> [[1]]
    #> [1] 335261344015 335261537633 344015537633
    #> 
    #> [[2]]
    #> [1] 22404132858
    #> 
    #> [[3]]
    #> [1] 254654355860 254654488288 355860488288
    #> 
    #> [[4]]
    #> [1] 219943373817
    #> 
    #> [[5]]
    #> [1] 331839404477
    
    
    ## OP original code
    combPairOP <- function(testVec) {
        lapply(strsplit(testVec, " "), function(x) unique(
            apply(combn(x, 2), 2, function(y) paste0(y, collapse = "")))
        )
    }
    

    As stated in the comments by the OP, the maximum number is less than a million (600000 to be exact), which means that after we multiply one of the numbers by at most 10^6 and add it to another 6 digit number (equivalent to simply concatenating two strings of numbers), we are guaranteed to be within the numerical precision of base R (i.e. 2^53 - 1). This is good because arithmetic operations on numerical numbers is much more efficient than strings operations.

    All that is left is to benchmark:

    test.vector <- makeTestSet(100, 50)
    
    microbenchmark(
        original = combPairOP(test.vector),
        new_impl = combPairFast(test.vector),
        times = 20,
        unit = "relative"
    )
    #> Unit: relative
    #>      expr   min    lq  mean median    uq  max neval
    #>  original 31.68 31.55 23.85  30.46 29.14 11.2    20
    #>  new_impl  1.00  1.00  1.00   1.00  1.00  1.0    20
    

    And on larger vectors:

    bigTest.vector <- makeTestSet(1000, 100, mySeed = 22, withRep = TRUE)
    
    ## Duplicate values exist
    any(sapply(strsplit(bigTest.vector, " ", fixed = TRUE), function(x) {
        any(duplicated(x))
    }))
    #> [1] TRUE
    
    
    system.time(t1 <- combPairFast(bigTest.vector))
    #>    user  system elapsed 
    #>   0.081   0.004   0.086
    
    
    system.time(t2 <- combPairOP(bigTest.vector))
    #>    user  system elapsed 
    #>   4.519   0.040   4.560
    
    
    ## results are the same
    all.equal(t1, lapply(t2, as.numeric))
    #> [1] TRUE