Search code examples
pythonpython-3.xpandasoptimizationpython-itertools

Memory optimization for generating data larger than RAM


Let say I want to generate the Cartesian product of a range, ie:

from itertools import product
var_range = range(-10000, 10000)
vars = list(product(var_range, repeat=2))
var[:10]

So the output is like:

[(0, 0),
 (0, 1),
 (0, 2),
 (0, 3),
 (0, 4),
 (0, 5),
 (0, 6),
 (0, 7),
 (0, 8),
 (0, 9)]

However, this seems to much for mu RAM and my IPython (12GB RAM) crashes.

I was thinking on spiting the ranges to batches and use it in fours loop iterations:

[-10000,-5000],[-4999,0],[1,5000],[5001,10000]

Then, after each iteration I could save it as pandas dataframe to h5 file and than append to the previous iteration outcome.

I have also read about generators in python.

  • If so, then HOW in this case, generators could bring optimisation?
  • What would be the most pythonic way to optimise such simple case?

Solution

  • Maybe this would work:

    from itertools import product
    var_range = range(-10000, 10000)
    vars = product(var_range, repeat=2)
    print([next(vars) for _ in range(10)])
    

    Converting non-list types to list takes a long time, especially with this long sequence, instead you could just use part of it, the first ten elements, then it should work, as you do with list(...), it processes the whole object, while next ten times doesn't.