Search code examples
pythonalgorithmcartesian-product

Batch processing on itertools python


I'm trying to normalize a JSON file using python.

Basically, I iterate through this JSON appending its values to different lists. Then, I union all this lists using itertools.product.

When I have a JSON with a large number of lines and lists my code stops with OOM.

NOTE: The Json File has 650 KB.

My method to get the product of all lists is:

final_list= []
# Eliminate empty lists
iterate_lists = [lst for lst in lists if lst]
if iterate_lists :
   for combinacao in product(*iterate_lists ):
      dict_final = {}
      for item in combinacao:
         dict_final.update(dict_base)
         dict_final.update(item)
      final_list.append(dict_final.copy())
return final_list

The iterate_lists has a max size of 20 lists and each of those lists has a max size of 70 values. Each list stores dict that will be permutaded (product).

I tried to upscale my cluster to 128GB RAM but OOM still happens. I also tried to use less threads, but OOM still hapens.

An example of a payload can be found here.

Like I said before, for that example my code works fine, but when there is a large payload and the lists inside the payload has a lot of values my codes gives OOM.

So my question is:

What is the best way to handle this problem since I have a large number of lists?


Solution

  • You could try using islice to limit the number of products per batch. Since itertools.product is a lazy function, it produce iterables.

    Try:

    from itertools import product, islice
    
    ...
    
    final_list= []
    # Eliminate empty lists
    iterate_lists = [lst for lst in lists if lst] 
    if iterate_lists :
        for chunk in islice(product(*iterate_lists), 1000000): # value of your choice, presumably 1,000,000
            for combinacao in chunk:
                dict_final = {}
                for item in combinacao:
                    dict_final.update(dict_base)
                    dict_final.update(item)
                final_list.append(dict_final.copy())
    
    return final_list
    

    Also, as @TimRoberts mentioned, there is no way a computer can handle a Cartesian product of 20 lists with 70 items, it would be a total of 7,979,226,629,761,200,100,000,000,000,000,000,000 items.

    So, maybe you're looking for either permutations, combinations or combinations_with_replacement.