Search code examples
pythonrandom

Randomly generate all unique pair-wise combination of elements between two list in set time


I have two lists:

a = [1, 2, 3, 5]
b = ["a", "b", "c", "d"]

And would like to generate all possible combinations with a python generator. I know I could be doing:

combinations = list(itertools.product(a,b))
random.shuffle(combinations)

But that one has an extreme memory cost as i would have to hold in memory all possible combinations, even if only wanted two random unique combinations.

My target is to get a python generator that has its memory cost increase with the more iterations are requested from it, getting to the same O memory cost as itertools at max iterations.

I had this for now:

def _unique_combinations(a: List, b: List):
    """
    Creates a generator that yields unique combinations of elements from a and b
    in the form of (a_element, b_element) tuples in a random order.
    """
    len_a, len_b = len(a), len(b)
    generated = set()
    for i in range(len_a):
        for j in range(len_b):
            while True:
                # choose random elements from a and b
                element_a = random.choice(a)
                element_b = random.choice(b)
                if (element_a, element_b) not in generated:
                    generated.add((element_a, element_b))
                    yield (element_a, element_b)
                    break

But its flawed as it can theoretically run forever if the random.choice lines are unlucky.

I'm looking to modify that existing generator so it generates the indexes randomly within a fix set of time, it will be okay to keep them track of as this will be linear increase in memory cost and not exponential.

How could i modify that random index generator to be bound in time?


Solution

  • We create a sequence using a prime number and one of its primitive roots modulo n that visits each number in an interval exactly once. More specifically we are looking for a generator of the multiplicative group of integers modulo n.

    We have to pick our prime number a little larger than the product len(a)*len(b), so we have to account for the cases in which we'd get index errors.

    import random
    from math import gcd
    import math
    
    def next_prime(number):
        if number < 0:
            raise ValueError('Negative numbers can not be primes')
        if number <= 1:
            return 2
        if number % 2 == 0:
            number -= 1
        while True:
            number += 2
            max_check = int(math.sqrt(number)) + 2
            for divider in range(3, max_check, 2):
                if number % divider == 0:
                    break
            else:
                return number
    
    
    def is_primitive_root(a, n):
        phi = n - 1
        factors = set()
        for i in range(2, int(phi ** 0.5) + 1):
            if phi % i == 0:
                factors.add(i)
                factors.add(phi // i)
        for factor in factors:
            if pow(a, factor, n) == 1:
                return False
        return True
    
    
    def find_random_primitive_root(n):
        while True:
            a = random.randint(2, n - 1)
            if gcd(a, n) == 1 and is_primitive_root(a, n):
                return a
    
    
    def sampler(l):
        close_prime = next_prime(l)
        state = root = find_random_primitive_root(close_prime)
        while state > l:
            state = (state * root) % close_prime  # Inlining the computation leads to a 20% speed up
    
        yield state - 1
        for i in range(l - 1):
            state = (state * root) % close_prime
            while state > l:
                state = (state * root) % close_prime
            yield state - 1
    

    Then we use a mapping from 1D -> 2D to "translate" our sequence number into a tuple and yield the result.

    def _unique_combinations(a, b):
        cartesian_product_cardinality = len(a) * len(b)
        sequence = sampler(cartesian_product_cardinality)
        len_b = len(b) # Function calls are expensive in python and this line yields a 10% total speed up
        for state in sequence:
            yield a[state // len_b], b[state % len_b)]
    
    
    from itertools import product
    
    a = [1, 2, 3, 5]
    b = ["a", "b", "c", "d"]
    u = _unique_combinations(a, b)
    
    assert sorted(u) == sorted(product(a, b))
    

    I started benchmarking the various approaches. For merging two lists of length 1000, the divmod solution by @gog already underperforms terrible so i'm going to exclude it from further testing:

    kelly took 0.9156949520111084 seconds
    divmod took 41.20149779319763 seconds
    prime_roots took 0.5146901607513428 seconds
    samwise took 0.698538064956665 seconds
    fisher_yates took 0.902874231338501 seconds
    

    For the remaining algorithms I benchmarked the following

    import pandas as pd
    import timeit
    import random
    from itertools import combinations
    from math import gcd
    # Define the list lengths to benchmark
    list_lengths = [10,20,30,100,300,500,1000,1500,2000,3000,5000]
    
    num_repetitions = 2
    
    results_df = pd.DataFrame(columns=['Approach', 'List Length', 'Execution Time'])
    
    for approach, function in approaches.items():
        for length in list_lengths:
            a = list(range(length))
            b = list(range(length))
    
            execution_time = timeit.timeit(lambda: list(function(a, b)), number=num_repetitions)
    
            results_df = results_df.append({
                'Approach': approach,
                'List Length': length,
                'Execution Time': execution_time 
            }, ignore_index=True)
    

    enter image description here All in all, I think all of the approaches are somewhat similar. All tested approaches fall in O(|a|*|b|) time-complexity wise.

    Memory-wise the prime roots approach wins just because all other approaches need to keep track of O(|a|*|b|) elements, whereas the prime roots doesn't require that.

    Distribution wise the prime roots approach is absolutely the worst because it's not actually random but rather a difficult-to-predict-deterministic sequence. In practice the sequences should be "random" enough.

    Credit to this stack overflow answer which inspired the solution.