I want to build a numpy
array from the result of itertools.product
. My first approach was a simple:
from itertools import product
import numpy as np
max_init = 6
init_values = range(1, max_init + 1)
repetitions = 12
result = np.array(list(product(init_values, repeat=repetitions)))
This code works well for "small" repetitions
(like <=4), but with "large" values (>= 12) it completely hogs the memory and crashes. I assumed that building the list was the thing eating all the RAM, so I searched how to make it directly with an array. I found Numpy equivalent of itertools.product and Using numpy to build an array of all combinations of two arrays.
So, I tested the following alternatives:
Alternative #1:
results = np.empty((max_init**repetitions, repetitions))
for i, row in enumerate(product(init_values, repeat=repetitions)):
result[:][i] = row
Alternative #2:
init_values_args = [init_values] * repetitions
results = np.array(np.meshgrid(*init_values_args)).T.reshape(-1, repetitions)
Alternative #3:
results = np.indices([sides] * num_dice).reshape(num_dice, -1).T + 1
#1 is extremely slow. I didn't have enough patience to let it finish (after a few minutes of processing on a 2017 MacBook Pro). #2 and #3 eat all the memory until the python interpreter crashes, as with the initial approach.
After that, I thought that I could express the same information in a different way that was still useful for me: a dict
where the keys would be all the possible (sorted) combinations, and the values would be the counting of these combinations. So I tried:
Alternative #4:
from collections import Counter
def sorted_product(iterable, repeat=1):
for el in product(iterable, repeat=repeat):
yield tuple(sorted(el))
def count_product(repetitions=1, max_init=6):
init_values = range(1, max_init + 1)
sp = sorted_product(init_values, repeat=repetitions)
counted_sp = Counter(sp)
return np.array(list(counted_sp.values())), \
np.array(list(counted_sp.keys()))
cnt, values = count(repetitions=repetitions, max_init=max_init)
But the line counted_sp = Counter(sp)
, which triggers getting all the values of the generators, is also too slow for my needs (it also took several minutes before I canceled it).
Is there another way to generate the same data (or a different data structure containing the same information) that does not have the mentioned shortcomings of being too slow or using too much memory?
PS: I tested all the implementations above against my tests with a small repetitions
, and all the tests passed, so they give consistent results.
I hope that editing the question is the best way to expand it. Otherwise, let me know, and I'll edit post where I should.
After reading the first two answers below and thinking about it, I agree that I am approaching the issue from the wrong angle. Instead of going with a "brute force" approach I should have used probabilities and work with that.
My intention is, later on, for each combination: - Count how many values are under a threshold X. - Count how many values are equal or over threshold X and below a threshold Y. - Count how many values are over threshold Y. And group the combinations that have the same counts.
As an illustrative example: If I roll 12 dice of 6 sides, what's the probability of having M dice with a value <3, N dice with a value >=3 and <4, and P dice with a value >5, for all possible combinations of M, N, and P?
So, I think that I'll close this question in a few days while I go with this new approach. Thank you for all the feedback and your time!
The number tuples that list(product(range(1,7), repeats=12))
makes is 6**12, 2,176,782,336. Whether a list or array that's probably too large for most computers.
In [119]: len(list(product(range(1,7),repeat=12)))
....
MemoryError:
Trying to make an array of that size directly:
In [129]: A = np.ones((6**12,12),int)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-129-e833a9e859e0> in <module>()
----> 1 A = np.ones((6**12,12),int)
/usr/local/lib/python3.5/dist-packages/numpy/core/numeric.py in ones(shape, dtype, order)
190
191 """
--> 192 a = empty(shape, dtype, order)
193 multiarray.copyto(a, 1, casting='unsafe')
194 return a
ValueError: Maximum allowed dimension exceeded
Array memory size, 4 bytes per item
In [130]: 4*12*6**12
Out[130]: 104,485,552,128
100GB?
Why do you need to generate 2B combinations of 7 numbers?
So with your Counter you reduce the number of items
In [138]: sp = sorted_product(range(1,7), 2)
In [139]: counter=Counter(sp)
In [140]: counter
Out[140]:
Counter({(1, 1): 1,
(1, 2): 2,
(1, 3): 2,
(1, 4): 2,
(1, 5): 2,
(1, 6): 2,
(2, 2): 1,
(2, 3): 2,
(2, 4): 2,
(2, 5): 2,
(2, 6): 2,
(3, 3): 1,
(3, 4): 2,
(3, 5): 2,
(3, 6): 2,
(4, 4): 1,
(4, 5): 2,
(4, 6): 2,
(5, 5): 1,
(5, 6): 2,
(6, 6): 1})
from 36 to 21 (for 2 repetitions). It shouldn't be hard to generalize this to more repetitions (combinations? permutations?) It still will push time and/or memory boundaries.
A variant on meshgrid
using mgrid
:
In [175]: n=7; A=np.mgrid[[slice(1,7)]*n].reshape(n,-1).T
In [176]: A.shape
Out[176]: (279936, 7)
In [177]: B=np.array(list(product(range(1,7),repeat=7)))
In [178]: B.shape
Out[178]: (279936, 7)
In [179]: A[:10]
Out[179]:
array([[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 2],
[1, 1, 1, 1, 1, 1, 3],
[1, 1, 1, 1, 1, 1, 4],
[1, 1, 1, 1, 1, 1, 5],
[1, 1, 1, 1, 1, 1, 6],
[1, 1, 1, 1, 1, 2, 1],
[1, 1, 1, 1, 1, 2, 2],
[1, 1, 1, 1, 1, 2, 3],
[1, 1, 1, 1, 1, 2, 4]])
In [180]: B[:10]
Out[180]:
array([[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 2],
[1, 1, 1, 1, 1, 1, 3],
[1, 1, 1, 1, 1, 1, 4],
[1, 1, 1, 1, 1, 1, 5],
[1, 1, 1, 1, 1, 1, 6],
[1, 1, 1, 1, 1, 2, 1],
[1, 1, 1, 1, 1, 2, 2],
[1, 1, 1, 1, 1, 2, 3],
[1, 1, 1, 1, 1, 2, 4]])
In [181]: np.allclose(A,B)
mgrid
is quite a bit faster:
In [182]: timeit B=np.array(list(product(range(1,7),repeat=7)))
317 ms ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [183]: timeit A=np.mgrid[[slice(1,7)]*n].reshape(n,-1).T
13.9 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
but, yes, it will have the same overall memory usage and limit.
With n=10,
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.