Search code examples
rdatasetcombinationscombn

How to iteratively perform combinations on larger datasets?


Background - I want to try and exhaustively search a set of all possible combinations of 250 rows taken 10 at a time. In order to iteratively get this, I use the following code

`
## Function definition
gen.next.cbn <- function(cbn, n){
  ## Generates the combination that follows the one provided as input
  cbn.bin      <- rep(0, n)
  cbn.bin[cbn] <- 1
  if (tail(cbn.bin, 1) == 0){
    ind <- tail(which(cbn.bin == 1), 1)
    cbn.bin[c(ind, ind+1)] <- c(0, 1)
  }else{
    ind <- 1 + tail(which(diff(cbn.bin) == -1), 1)
    nb  <- sum(cbn.bin[-c(1:ind)] == 1)
    cbn.bin[c(ind-1, (n-nb+1):n)] <- 0
    cbn.bin[ind:(ind+nb)]         <- 1
  }
  cbn <- which(cbn.bin == 1)
}

## Example parameters
n   <- 40
k   <- 10

## Iteration example
for (i in 1:choose(n, k)){
  if (i == 1){
    cbn <- 1:k
  }else{
    cbn <- gen.next.cbn(cbn, n)

  }
  print(cbn)


}


`

I get the error "cannot allocate vector of size n GB" when I go beyond 40 rows.

Ideal Solution: a) If the combinations can be dumped and memory can be flushed iteratively after every run in the loop (where I can check the further conditions) b) If the combinations can be dumped to a csv file such that it does not cause a memory hog.

Thanks for your support.


Solution

  • As I said in the comments, iterpc is the way to go for such a task. You first need to initialize an iterator via the iterpc function. Next we can generate the next n combinations via getnext. After this, we simply append our results to a csv (or any file type you like).

    getComboChunks <- function(n, k, chunkSize, totalCombos, myFile) {
        myIter <- iterpc(n, k)
    
        ## initialized myFile
        myCombs <- getnext(myIter, chunkSize)
        write.table(myCombs, file = myFile, sep = ",", col.names = FALSE)
    
        maxIteration <- (totalCombos - chunkSize) %/% chunkSize
    
        for (i in 1:maxIteration) {
            ## get the next "chunkSize" of combinations
            myCombs <- getnext(myIter, chunkSize)
    
            ## append the above combinations to your file
            write.table(myCombs, file = myFile, sep = ",",
                        col.names = FALSE , append = TRUE)
        }
    }
    

    For example, getComboChunks(250, 10, 100, 1000, "myCombos.csv") will write out 1000 combinations of 250 choose 10 to the file myCombos.csv 100 combinations at a time. Doing this in chunks will be more efficient than one at a time.

    This library is written in C/C++ so it should be fairly efficient, but as @Florian points out in the comments, it won't produce all gmp::chooseZ(250, 10) = Big Integer ('bigz') : [1] 219005316087032475 combinations any time soon. I haven't tested it, but if you settle for 200 choose 5, I think you will be able to produce it in under a day (it is just over 2.5 billion results).