Search code examples
pythonperformancerandomcombinations

Generating a random (equal probability) combination with replacement


I want to generate one random combination out of all possible combinations_with_replacement. The tricky bit is that I want each of the possible outcomes to have the same probability without needing to generate (not even implicit) all the possible outcomes.

For example:

import itertools
import random

random.choice(list(itertools.combinations_with_replacement(range(4), 2)))

That approach is way too slow (and memory expensive) because it needs to create all possible combinations whereas I only want one.

It's not so bad if I determine how many combinations_with_replacement there will be and use random.randrange together with next and itertools.islice on the itertools.combinations_with_replacement. That doesn't need to generate all possible combinations (except in the worst-case). But it's still too slow.

On the other hand the recipe mentioned in the itertools documentation is fast but not each combination has the same probability.


Solution

  • Well, I'm in a bit of a dilemma, because I've found an algorithm that works, but I don't know why. So do what you want of if, maybe some mathematician in the room can work out the probabilities, but it does empirically work. The idea is to pick one element at a time, increasing the probability of the selected elements. I suspect the reasoning must be similar to that of reservoir sampling, but I didn't work it out.

    from random import choice
    from itertools import combinations_with_replacement
    
    population = ["A", "B", "C", "D"]
    k = 3
    
    def random_comb(population, k):
        idx = []
        indices = list(range(len(population)))
        for _ in range(k):
            idx.append(choice(indices))
            indices.append(idx[-1])
        return tuple(population[i] for i in sorted(idx))
    
    combs = list(combinations_with_replacement(population, k))
    counts = {c: 0 for c in combs}
    
    for _ in range(100000):
        counts[random_comb(population, k)] += 1
    
    for comb, count in sorted(counts.items()):
        print("".join(comb), count)
    

    The output is the number of times each possibility has appeared after 100,000 runs:

    AAA 4913
    AAB 4917
    AAC 5132
    AAD 4966
    ABB 5027
    ABC 4956
    ABD 4959
    ACC 5022
    ACD 5088
    ADD 4985
    BBB 5060
    BBC 5070
    BBD 5056
    BCC 4897
    BCD 5049
    BDD 5059
    CCC 5024
    CCD 5032
    CDD 4859
    DDD 4929