Given itertools combinations with an r of 4:
from itertools import combinations
mylist = range(0,35)
r = 4
combinationslist = list(combinations(mylist, r))
Which will output:
(0, 1, 2, 3)
(0, 1, 2, 4)
(0, 1, 2, 5)
(0, 1, 2, 6)
(0, 1, 2, 7)
(0, 1, 2, 8)
(0, 1, 2, 9)
...
(30, 31, 32, 33)
(30, 31, 32, 34)
(30, 31, 33, 34)
(30, 32, 33, 34)
(31, 32, 33, 34)
My question is if we were to chunk to the list into blocks of 10 can we find what nth a combination is within those blocks, but without generating all combinations. Or in another words if the position is divisible by x.
One of the problems with this is the positions will get into the billions of billions and might not be possible to derive what the nth is. Is there a heuristic that can regardless find whether a particular combination/sequence of elements is divisible by x
Edit/addition: The reasoning for this question is for situations where the list is range(0,1000000) and r =30000 for example. Then provided a combination, find if it's divisible by x. Naturally the actual index will be ridiculously enormous (and the full combinations too much to generate)
I have authored a package in R
called RcppAlgos
that has functions specifically for this task.
Use comboRank
from RcppAlgos
.
In the article that @wim linked to, you will see that this procedure is often called ranking and as many have pointed out, this boils down to counting.
In the RcppAlgos
package there are several ranking functions for ranking various structures (e.g. partitionsRank
for ranking integer partitions). We will use comboRank
for the task at hand:
library(RcppAlgos)
## Generate random combination from 1:35 of length 4
set.seed(42)
small_random_comb = sort(sample(35, 4))
## Print the combination
small_random_comb
#> [1] 1 4 10 25
## Ranking the combination with comboRank (See ?comboRank for more info).
## N.B. for the ranking functions, we must provide the source vector to rank appropriately
idx = comboRank(small_random_comb, v = 35)
## Remember, R is base 1.
idx
#> [1] 1179
## Generate all combinations to confirm
all_combs = comboGeneral(35, 4)
## Same result
all_combs[idx, ]
#> [1] 1 4 10 25
The functions are very efficient as well. They are written in C++
and use the gmp
library for handling large numbers.
Are they efficient enough for the very large case n = 1000000
and r = 10000
(or even r = 30000
)?
set.seed(97)
large_random_comb = sort(sample(1e6, 1e4))
head(large_random_comb)
#> [1] 76 104 173 608 661 828
tail(large_random_comb)
#> [1] 999684 999731 999732 999759 999824 999878
system.time(lrg_idx <- comboRank(large_random_comb, v = 1e6))
#> user system elapsed
#> 2.036 0.003 2.039
## Let’s not print this number as it is over 20,000 digits
gmp::log10.bigz(lrg_idx)
#> [1] 24318.49
## And for r = 30000 we have:
set.seed(123)
really_large_random_comb = sort(sample(1e6, 3e4))
system.time(really_lrg_idx <- comboRank(really_large_random_comb, v = 1e6))
#> user system elapsed
#> 4.942 0.003 4.945
gmp::log10.bigz(really_lrg_idx)
#> [1] 58514.98
Under 5 seconds ain't that bad!
We can use comboSample
, which essentially “unranks” when we use the sampleVec
argument, for confirmation:
check_large_comb = comboSample(1e6, 1e4, sampleVec = lrg_idx)
## Sense comboSample returns a matrix, we must convert to a vector before we compare
identical(as.vector(check_large_comb), large_random_comb)
#> [1] TRUE
And if you need this in python
, we can make use of rpy2
. Here is a snippet from a Jupyter Notebook:
#> Cell 0
-------------------------------------------------------
import rpy2
import random
from itertools import combinations
mylist = range(0,35)
r = 4
combinationslist = list(combinations(mylist, r))
combo = random.choice(combinationslist)
combo
-------------------------------------------------------
#> Out[25]: (1, 25, 30, 31)
#> Cell 1
-------------------------------------------------------
## Convert it to a list to ease the transition to R
lst_combo = list(combo)
-------------------------------------------------------
#> Cell 2
-------------------------------------------------------
%load_ext rpy2.ipython
-------------------------------------------------------
#> Cell 3
-------------------------------------------------------
%%R -i lst_combo -o idx
library(RcppAlgos)
idx = comboRank(lst_combo, v = 0:34)
-------------------------------------------------------
#> Cell 4
-------------------------------------------------------
idx[0]
-------------------------------------------------------
#> Out[39]: 11347
#> Cell 5
-------------------------------------------------------
## R is base 1, so we subtract 1
combinationslist[idx[0] - 1]
-------------------------------------------------------
#> Out[40]: (1, 25, 30, 31)
Even if we were to translate the excellent algorithm outlined by @wim to a compiled language, we would still not be anywhere close to tackling the large cases presented here. That is because successive calls to any combinatorial function, no matter how optimized, are expensive.
Instead, we take advantage of the fact that this algorithm relies on very subtle differences on each iteration. For example, what if we wanted to calculate the following 3 numbers:
Given the formula for nCr:
n! / (r! * (n - r)!)
With this we can use the result in 1 to get the result in step 2 with only two operations and we can use this result to get the result in step 3 in only two operations as well! Observe:
(15504 * 15) / 20 = 11628
(11628 * 14) / 19 = 8568
This is the key idea behind most of the ranking/unranking algorithms in RcppAlgos
.
I'm not sure of an elegant way to get to the C++
code in RcppAlgos
from python
. Probably the best solution if you don't want to deal with rpy2
is to adapt the algorithms below to your personal needs:
https://github.com/jwood000/RcppAlgos/blob/main/src/RankCombination.cpp