Search code examples
pythonarraysnumpygeneratorpython-itertools

Build a large numpy array from itertools.product


I want to build a numpy array from the result of itertools.product. My first approach was a simple:

from itertools import product
import numpy as np

max_init = 6
init_values = range(1, max_init + 1)
repetitions = 12

result = np.array(list(product(init_values, repeat=repetitions)))

This code works well for "small" repetitions (like <=4), but with "large" values (>= 12) it completely hogs the memory and crashes. I assumed that building the list was the thing eating all the RAM, so I searched how to make it directly with an array. I found Numpy equivalent of itertools.product and Using numpy to build an array of all combinations of two arrays.

So, I tested the following alternatives:

Alternative #1:

results = np.empty((max_init**repetitions, repetitions))
for i, row in enumerate(product(init_values, repeat=repetitions)):
    result[:][i] = row

Alternative #2:

init_values_args = [init_values] * repetitions
results = np.array(np.meshgrid(*init_values_args)).T.reshape(-1, repetitions)

Alternative #3:

results = np.indices([sides] * num_dice).reshape(num_dice, -1).T + 1

#1 is extremely slow. I didn't have enough patience to let it finish (after a few minutes of processing on a 2017 MacBook Pro). #2 and #3 eat all the memory until the python interpreter crashes, as with the initial approach.

After that, I thought that I could express the same information in a different way that was still useful for me: a dict where the keys would be all the possible (sorted) combinations, and the values would be the counting of these combinations. So I tried:

Alternative #4:

from collections import Counter

def sorted_product(iterable, repeat=1):
    for el in product(iterable, repeat=repeat):
        yield tuple(sorted(el))

def count_product(repetitions=1, max_init=6):
    init_values = range(1, max_init + 1)
    sp = sorted_product(init_values, repeat=repetitions)
    counted_sp = Counter(sp)
    return np.array(list(counted_sp.values())), \
        np.array(list(counted_sp.keys()))

cnt, values = count(repetitions=repetitions, max_init=max_init)

But the line counted_sp = Counter(sp), which triggers getting all the values of the generators, is also too slow for my needs (it also took several minutes before I canceled it).

Is there another way to generate the same data (or a different data structure containing the same information) that does not have the mentioned shortcomings of being too slow or using too much memory?

PS: I tested all the implementations above against my tests with a small repetitions, and all the tests passed, so they give consistent results.


I hope that editing the question is the best way to expand it. Otherwise, let me know, and I'll edit post where I should.

After reading the first two answers below and thinking about it, I agree that I am approaching the issue from the wrong angle. Instead of going with a "brute force" approach I should have used probabilities and work with that.

My intention is, later on, for each combination: - Count how many values are under a threshold X. - Count how many values are equal or over threshold X and below a threshold Y. - Count how many values are over threshold Y. And group the combinations that have the same counts.

As an illustrative example: If I roll 12 dice of 6 sides, what's the probability of having M dice with a value <3, N dice with a value >=3 and <4, and P dice with a value >5, for all possible combinations of M, N, and P?

So, I think that I'll close this question in a few days while I go with this new approach. Thank you for all the feedback and your time!


Solution

  • The number tuples that list(product(range(1,7), repeats=12)) makes is 6**12, 2,176,782,336. Whether a list or array that's probably too large for most computers.

    In [119]: len(list(product(range(1,7),repeat=12)))
    ....
    MemoryError:
    

    Trying to make an array of that size directly:

    In [129]: A = np.ones((6**12,12),int)
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-129-e833a9e859e0> in <module>()
    ----> 1 A = np.ones((6**12,12),int)
    
    /usr/local/lib/python3.5/dist-packages/numpy/core/numeric.py in ones(shape, dtype, order)
        190 
        191     """
    --> 192     a = empty(shape, dtype, order)
        193     multiarray.copyto(a, 1, casting='unsafe')
        194     return a
    
    ValueError: Maximum allowed dimension exceeded
    

    Array memory size, 4 bytes per item

    In [130]: 4*12*6**12
    Out[130]: 104,485,552,128
    

    100GB?

    Why do you need to generate 2B combinations of 7 numbers?


    So with your Counter you reduce the number of items

    In [138]: sp = sorted_product(range(1,7), 2)
    In [139]: counter=Counter(sp)
    In [140]: counter
    Out[140]: 
    Counter({(1, 1): 1,
             (1, 2): 2,
             (1, 3): 2,
             (1, 4): 2,
             (1, 5): 2,
             (1, 6): 2,
             (2, 2): 1,
             (2, 3): 2,
             (2, 4): 2,
             (2, 5): 2,
             (2, 6): 2,
             (3, 3): 1,
             (3, 4): 2,
             (3, 5): 2,
             (3, 6): 2,
             (4, 4): 1,
             (4, 5): 2,
             (4, 6): 2,
             (5, 5): 1,
             (5, 6): 2,
             (6, 6): 1})
    

    from 36 to 21 (for 2 repetitions). It shouldn't be hard to generalize this to more repetitions (combinations? permutations?) It still will push time and/or memory boundaries.


    A variant on meshgrid using mgrid:

    In [175]: n=7; A=np.mgrid[[slice(1,7)]*n].reshape(n,-1).T
    In [176]: A.shape
    Out[176]: (279936, 7)
    In [177]: B=np.array(list(product(range(1,7),repeat=7)))
    In [178]: B.shape
    Out[178]: (279936, 7)
    In [179]: A[:10]
    Out[179]: 
    array([[1, 1, 1, 1, 1, 1, 1],
           [1, 1, 1, 1, 1, 1, 2],
           [1, 1, 1, 1, 1, 1, 3],
           [1, 1, 1, 1, 1, 1, 4],
           [1, 1, 1, 1, 1, 1, 5],
           [1, 1, 1, 1, 1, 1, 6],
           [1, 1, 1, 1, 1, 2, 1],
           [1, 1, 1, 1, 1, 2, 2],
           [1, 1, 1, 1, 1, 2, 3],
           [1, 1, 1, 1, 1, 2, 4]])
    In [180]: B[:10]
    Out[180]: 
    array([[1, 1, 1, 1, 1, 1, 1],
           [1, 1, 1, 1, 1, 1, 2],
           [1, 1, 1, 1, 1, 1, 3],
           [1, 1, 1, 1, 1, 1, 4],
           [1, 1, 1, 1, 1, 1, 5],
           [1, 1, 1, 1, 1, 1, 6],
           [1, 1, 1, 1, 1, 2, 1],
           [1, 1, 1, 1, 1, 2, 2],
           [1, 1, 1, 1, 1, 2, 3],
           [1, 1, 1, 1, 1, 2, 4]])
    In [181]: np.allclose(A,B)
    

    mgrid is quite a bit faster:

    In [182]: timeit B=np.array(list(product(range(1,7),repeat=7)))
    317 ms ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    In [183]: timeit A=np.mgrid[[slice(1,7)]*n].reshape(n,-1).T
    13.9 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    but, yes, it will have the same overall memory usage and limit.

    With n=10,

    ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.