Search code examples
pythoncombinations

Itertools combinations find if a combination is divisible


Given itertools combinations with an r of 4:

from itertools import combinations

mylist = range(0,35) 
r = 4
combinationslist = list(combinations(mylist, r))

Which will output:

(0, 1, 2, 3)
(0, 1, 2, 4)
(0, 1, 2, 5)
(0, 1, 2, 6)
(0, 1, 2, 7)
(0, 1, 2, 8)
(0, 1, 2, 9)
...
(30, 31, 32, 33)
(30, 31, 32, 34)
(30, 31, 33, 34)
(30, 32, 33, 34)
(31, 32, 33, 34)

My question is if we were to chunk to the list into blocks of 10 can we find what nth a combination is within those blocks, but without generating all combinations. Or in another words if the position is divisible by x.

One of the problems with this is the positions will get into the billions of billions and might not be possible to derive what the nth is. Is there a heuristic that can regardless find whether a particular combination/sequence of elements is divisible by x

Edit/addition: The reasoning for this question is for situations where the list is range(0,1000000) and r =30000 for example. Then provided a combination, find if it's divisible by x. Naturally the actual index will be ridiculously enormous (and the full combinations too much to generate)


Solution

  • I have authored a package in R called RcppAlgos that has functions specifically for this task.

    TL;DR

    Use comboRank from RcppAlgos.

    Details

    In the article that @wim linked to, you will see that this procedure is often called ranking and as many have pointed out, this boils down to counting.

    In the RcppAlgos package there are several ranking functions for ranking various structures (e.g. partitionsRank for ranking integer partitions). We will use comboRank for the task at hand:

    library(RcppAlgos)
    
    ## Generate random combination from 1:35 of length 4
    set.seed(42)
    small_random_comb = sort(sample(35, 4))
    
    ## Print the combination
    small_random_comb
    #> [1]  1  4 10 25
    
    ## Ranking the combination with comboRank (See ?comboRank for more info).
    ## N.B. for the ranking functions, we must provide the source vector to rank appropriately
    idx = comboRank(small_random_comb, v = 35)
    
    ##  Remember, R is base 1.
    idx
    #> [1] 1179
    
    ## Generate all combinations to confirm
    all_combs = comboGeneral(35, 4)
    
    ## Same result
    all_combs[idx, ]
    #> [1]  1  4 10 25
    

    Efficiency

    The functions are very efficient as well. They are written in C++ and use the gmp library for handling large numbers.

    Are they efficient enough for the very large case n = 1000000 and r = 10000 (or even r = 30000)?

    set.seed(97)
    large_random_comb = sort(sample(1e6, 1e4))
    
    head(large_random_comb)
    #> [1]  76 104 173 608 661 828
    
    tail(large_random_comb)
    #> [1] 999684 999731 999732 999759 999824 999878
    
    system.time(lrg_idx <- comboRank(large_random_comb, v = 1e6))
    #>   user  system elapsed 
    #>  2.036   0.003   2.039
    
    ## Let’s not print this number as it is over 20,000 digits
    gmp::log10.bigz(lrg_idx)
    #> [1] 24318.49
    
    ## And for r = 30000 we have:
    set.seed(123)
    really_large_random_comb = sort(sample(1e6, 3e4))
    system.time(really_lrg_idx <- comboRank(really_large_random_comb, v = 1e6))
    #>   user  system elapsed 
    #>  4.942   0.003   4.945 
    gmp::log10.bigz(really_lrg_idx)
    #> [1] 58514.98
    

    Under 5 seconds ain't that bad!

    We can use comboSample, which essentially “unranks” when we use the sampleVec argument, for confirmation:

    check_large_comb = comboSample(1e6, 1e4, sampleVec = lrg_idx)
    
    ## Sense comboSample returns a matrix, we must convert to a vector before we compare
    identical(as.vector(check_large_comb), large_random_comb)
    #> [1] TRUE
    

    What about Python?

    And if you need this in python, we can make use of rpy2. Here is a snippet from a Jupyter Notebook:

    #> Cell 0
    -------------------------------------------------------
    import rpy2
    import random
    from itertools import combinations
    
    mylist = range(0,35) 
    r = 4
    combinationslist = list(combinations(mylist, r))
    combo = random.choice(combinationslist)
    combo
    -------------------------------------------------------
    #> Out[25]: (1, 25, 30, 31)
    
    #> Cell 1
    -------------------------------------------------------
    ## Convert it to a list to ease the transition to R
    lst_combo = list(combo)
    -------------------------------------------------------
    
    #> Cell 2
    -------------------------------------------------------
    %load_ext rpy2.ipython
    -------------------------------------------------------
    
    #> Cell 3
    -------------------------------------------------------
    %%R -i lst_combo -o idx
    ​
    library(RcppAlgos)
    idx = comboRank(lst_combo, v = 0:34)
    -------------------------------------------------------
    
    #> Cell 4
    -------------------------------------------------------
    idx[0]
    -------------------------------------------------------
    #> Out[39]: 11347
    
    #> Cell 5
    -------------------------------------------------------
    ## R is base 1, so we subtract 1
    combinationslist[idx[0] - 1]
    -------------------------------------------------------
    #> Out[40]: (1, 25, 30, 31)
    

    Addendum - Key Idea in Ranking Algorithm

    Even if we were to translate the excellent algorithm outlined by @wim to a compiled language, we would still not be anywhere close to tackling the large cases presented here. That is because successive calls to any combinatorial function, no matter how optimized, are expensive.

    Instead, we take advantage of the fact that this algorithm relies on very subtle differences on each iteration. For example, what if we wanted to calculate the following 3 numbers:

    1. nCr(20, 15) = 15504
    2. nCr(19, 14) = 11628
    3. nCr(18, 13) = 8568

    Given the formula for nCr:

    n! / (r! * (n - r)!)

    With this we can use the result in 1 to get the result in step 2 with only two operations and we can use this result to get the result in step 3 in only two operations as well! Observe:

    (15504 * 15) / 20 = 11628
    (11628 * 14) / 19 = 8568
    

    This is the key idea behind most of the ranking/unranking algorithms in RcppAlgos.

    I'm not sure of an elegant way to get to the C++ code in RcppAlgos from python. Probably the best solution if you don't want to deal with rpy2 is to adapt the algorithms below to your personal needs: https://github.com/jwood000/RcppAlgos/blob/main/src/RankCombination.cpp