I have two lists:
a = [1, 2, 3, 5]
b = ["a", "b", "c", "d"]
And would like to generate all possible combinations with a python generator. I know I could be doing:
combinations = list(itertools.product(a,b))
random.shuffle(combinations)
But that one has an extreme memory cost as i would have to hold in memory all possible combinations, even if only wanted two random unique combinations.
My target is to get a python generator that has its memory cost increase with the more iterations are requested from it, getting to the same O memory cost as itertools at max iterations.
I had this for now:
def _unique_combinations(a: List, b: List):
"""
Creates a generator that yields unique combinations of elements from a and b
in the form of (a_element, b_element) tuples in a random order.
"""
len_a, len_b = len(a), len(b)
generated = set()
for i in range(len_a):
for j in range(len_b):
while True:
# choose random elements from a and b
element_a = random.choice(a)
element_b = random.choice(b)
if (element_a, element_b) not in generated:
generated.add((element_a, element_b))
yield (element_a, element_b)
break
But its flawed as it can theoretically run forever if the random.choice lines are unlucky.
I'm looking to modify that existing generator so it generates the indexes randomly within a fix set of time, it will be okay to keep them track of as this will be linear increase in memory cost and not exponential.
How could i modify that random index generator to be bound in time?
We create a sequence using a prime number and one of its primitive roots modulo n that visits each number in an interval exactly once. More specifically we are looking for a generator of the multiplicative group of integers modulo n.
We have to pick our prime number a little larger than the product len(a)*len(b)
, so we have to account for the cases in which we'd get index errors.
import random
from math import gcd
import math
def next_prime(number):
if number < 0:
raise ValueError('Negative numbers can not be primes')
if number <= 1:
return 2
if number % 2 == 0:
number -= 1
while True:
number += 2
max_check = int(math.sqrt(number)) + 2
for divider in range(3, max_check, 2):
if number % divider == 0:
break
else:
return number
def is_primitive_root(a, n):
phi = n - 1
factors = set()
for i in range(2, int(phi ** 0.5) + 1):
if phi % i == 0:
factors.add(i)
factors.add(phi // i)
for factor in factors:
if pow(a, factor, n) == 1:
return False
return True
def find_random_primitive_root(n):
while True:
a = random.randint(2, n - 1)
if gcd(a, n) == 1 and is_primitive_root(a, n):
return a
def sampler(l):
close_prime = next_prime(l)
state = root = find_random_primitive_root(close_prime)
while state > l:
state = (state * root) % close_prime # Inlining the computation leads to a 20% speed up
yield state - 1
for i in range(l - 1):
state = (state * root) % close_prime
while state > l:
state = (state * root) % close_prime
yield state - 1
Then we use a mapping from 1D -> 2D to "translate" our sequence number into a tuple and yield the result.
def _unique_combinations(a, b):
cartesian_product_cardinality = len(a) * len(b)
sequence = sampler(cartesian_product_cardinality)
len_b = len(b) # Function calls are expensive in python and this line yields a 10% total speed up
for state in sequence:
yield a[state // len_b], b[state % len_b)]
from itertools import product
a = [1, 2, 3, 5]
b = ["a", "b", "c", "d"]
u = _unique_combinations(a, b)
assert sorted(u) == sorted(product(a, b))
I started benchmarking the various approaches. For merging two lists of length 1000, the divmod
solution by @gog already underperforms terrible so i'm going to exclude it from further testing:
kelly took 0.9156949520111084 seconds
divmod took 41.20149779319763 seconds
prime_roots took 0.5146901607513428 seconds
samwise took 0.698538064956665 seconds
fisher_yates took 0.902874231338501 seconds
For the remaining algorithms I benchmarked the following
import pandas as pd
import timeit
import random
from itertools import combinations
from math import gcd
# Define the list lengths to benchmark
list_lengths = [10,20,30,100,300,500,1000,1500,2000,3000,5000]
num_repetitions = 2
results_df = pd.DataFrame(columns=['Approach', 'List Length', 'Execution Time'])
for approach, function in approaches.items():
for length in list_lengths:
a = list(range(length))
b = list(range(length))
execution_time = timeit.timeit(lambda: list(function(a, b)), number=num_repetitions)
results_df = results_df.append({
'Approach': approach,
'List Length': length,
'Execution Time': execution_time
}, ignore_index=True)
All in all, I think all of the approaches are somewhat similar. All tested approaches fall in
O(|a|*|b|)
time-complexity wise.
Memory-wise the prime roots approach wins just because all other approaches need to keep track of O(|a|*|b|)
elements, whereas the prime roots doesn't require that.
Distribution wise the prime roots approach is absolutely the worst because it's not actually random but rather a difficult-to-predict-deterministic sequence. In practice the sequences should be "random" enough.
Credit to this stack overflow answer which inspired the solution.