I'm trying to normalize a JSON file using python.
Basically, I iterate through this JSON appending its values to different lists. Then, I union all this lists using itertools.product
.
When I have a JSON with a large number of lines and lists my code stops with OOM.
NOTE: The Json File has 650 KB.
My method to get the product of all lists is:
final_list= []
# Eliminate empty lists
iterate_lists = [lst for lst in lists if lst]
if iterate_lists :
for combinacao in product(*iterate_lists ):
dict_final = {}
for item in combinacao:
dict_final.update(dict_base)
dict_final.update(item)
final_list.append(dict_final.copy())
return final_list
The iterate_lists
has a max size of 20 lists and each of those lists has a max size of 70 values. Each list stores dict that will be permutaded (product).
I tried to upscale my cluster to 128GB RAM but OOM still happens. I also tried to use less threads, but OOM still hapens.
An example of a payload can be found here.
Like I said before, for that example my code works fine, but when there is a large payload and the lists inside the payload has a lot of values my codes gives OOM.
So my question is:
What is the best way to handle this problem since I have a large number of lists?
You could try using islice
to limit the number of products per batch. Since itertools.product
is a lazy function, it produce iterables.
Try:
from itertools import product, islice
...
final_list= []
# Eliminate empty lists
iterate_lists = [lst for lst in lists if lst]
if iterate_lists :
for chunk in islice(product(*iterate_lists), 1000000): # value of your choice, presumably 1,000,000
for combinacao in chunk:
dict_final = {}
for item in combinacao:
dict_final.update(dict_base)
dict_final.update(item)
final_list.append(dict_final.copy())
return final_list
Also, as @TimRoberts mentioned, there is no way a computer can handle a Cartesian product of 20
lists with 70
items, it would be a total of 7,979,226,629,761,200,100,000,000,000,000,000,000
items.
So, maybe you're looking for either permutations
, combinations
or combinations_with_replacement
.