Randomly generate all unique pair-wise combination of elements between two list in set time

I have two lists:

a = [1, 2, 3, 5]
b = ["a", "b", "c", "d"]

And would like to generate all possible combinations with a python generator. I know I could be doing:

combinations = list(itertools.product(a,b))
random.shuffle(combinations)

But that one has an extreme memory cost as i would have to hold in memory all possible combinations, even if only wanted two random unique combinations.

My target is to get a python generator that has its memory cost increase with the more iterations are requested from it, getting to the same O memory cost as itertools at max iterations.

I had this for now:

def _unique_combinations(a: List, b: List):
    """
    Creates a generator that yields unique combinations of elements from a and b
    in the form of (a_element, b_element) tuples in a random order.
    """
    len_a, len_b = len(a), len(b)
    generated = set()
    for i in range(len_a):
        for j in range(len_b):
            while True:
                # choose random elements from a and b
                element_a = random.choice(a)
                element_b = random.choice(b)
                if (element_a, element_b) not in generated:
                    generated.add((element_a, element_b))
                    yield (element_a, element_b)
                    break

But its flawed as it can theoretically run forever if the random.choice lines are unlucky.

I'm looking to modify that existing generator so it generates the indexes randomly within a fix set of time, it will be okay to keep them track of as this will be linear increase in memory cost and not exponential.

How could i modify that random index generator to be bound in time?

Solution

We create a sequence using a prime number and one of its primitive roots modulo n that visits each number in an interval exactly once. More specifically we are looking for a generator of the multiplicative group of integers modulo n.

We have to pick our prime number a little larger than the product len(a)*len(b), so we have to account for the cases in which we'd get index errors.

import random
from math import gcd
import math

def next_prime(number):
    if number < 0:
        raise ValueError('Negative numbers can not be primes')
    if number <= 1:
        return 2
    if number % 2 == 0:
        number -= 1
    while True:
        number += 2
        max_check = int(math.sqrt(number)) + 2
        for divider in range(3, max_check, 2):
            if number % divider == 0:
                break
        else:
            return number


def is_primitive_root(a, n):
    phi = n - 1
    factors = set()
    for i in range(2, int(phi ** 0.5) + 1):
        if phi % i == 0:
            factors.add(i)
            factors.add(phi // i)
    for factor in factors:
        if pow(a, factor, n) == 1:
            return False
    return True


def find_random_primitive_root(n):
    while True:
        a = random.randint(2, n - 1)
        if gcd(a, n) == 1 and is_primitive_root(a, n):
            return a


def sampler(l):
    close_prime = next_prime(l)
    state = root = find_random_primitive_root(close_prime)
    while state > l:
        state = (state * root) % close_prime  # Inlining the computation leads to a 20% speed up

    yield state - 1
    for i in range(l - 1):
        state = (state * root) % close_prime
        while state > l:
            state = (state * root) % close_prime
        yield state - 1

Then we use a mapping from 1D -> 2D to "translate" our sequence number into a tuple and yield the result.

def _unique_combinations(a, b):
    cartesian_product_cardinality = len(a) * len(b)
    sequence = sampler(cartesian_product_cardinality)
    len_b = len(b) # Function calls are expensive in python and this line yields a 10% total speed up
    for state in sequence:
        yield a[state // len_b], b[state % len_b)]


from itertools import product

a = [1, 2, 3, 5]
b = ["a", "b", "c", "d"]
u = _unique_combinations(a, b)

assert sorted(u) == sorted(product(a, b))

I started benchmarking the various approaches. For merging two lists of length 1000, the divmod solution by @gog already underperforms terrible so i'm going to exclude it from further testing:

kelly took 0.9156949520111084 seconds
divmod took 41.20149779319763 seconds
prime_roots took 0.5146901607513428 seconds
samwise took 0.698538064956665 seconds
fisher_yates took 0.902874231338501 seconds

For the remaining algorithms I benchmarked the following

import pandas as pd
import timeit
import random
from itertools import combinations
from math import gcd
# Define the list lengths to benchmark
list_lengths = [10,20,30,100,300,500,1000,1500,2000,3000,5000]

num_repetitions = 2

results_df = pd.DataFrame(columns=['Approach', 'List Length', 'Execution Time'])

for approach, function in approaches.items():
    for length in list_lengths:
        a = list(range(length))
        b = list(range(length))

        execution_time = timeit.timeit(lambda: list(function(a, b)), number=num_repetitions)

        results_df = results_df.append({
            'Approach': approach,
            'List Length': length,
            'Execution Time': execution_time 
        }, ignore_index=True)

All in all, I think all of the approaches are somewhat similar. All tested approaches fall in O(|a|*|b|) time-complexity wise.

Memory-wise the prime roots approach wins just because all other approaches need to keep track of O(|a|*|b|) elements, whereas the prime roots doesn't require that.

Distribution wise the prime roots approach is absolutely the worst because it's not actually random but rather a difficult-to-predict-deterministic sequence. In practice the sequences should be "random" enough.

Credit to this stack overflow answer which inspired the solution.