I have a class with a bunch of functions to check the data against a huge dataframe ~33gb. Each value from the variable is run against one of the column from the dataframe (lets say column D) which is then appended to the dataframe itself for other iterations to compute the value.
Anyway, i is run against df.D and j is run against df.D and the result of i so on and so on. I am trying to see what set of numbers will provide the best output. Below is snippet of how the code looks like.
program.py
class Test:
def runTest():
pass
def run():
runTest()
bunch of if/else statements to check the data
pd.to_csv to export the result
def aa(int):
calculation..
def bb(int):
do something
...
runTest.py
for i in range(10,25):
for j in range(45,85):
for k in range(6,16):
for l in range(7,21):
for m in range(65,75):
class hello(Test):
def runTest():
a = aa(i)
b = bb(j)
...
hello().run()
I have tried itertools.product to make a list of all the numbers from the range. But I do not know how to pluck those values in my program. I would like it to be scalable as the ranges will be much bigger and will be adding more parameters to test the program.
How do I run these nested for loops with dask or multiprocessing to minimize the time to run this task? or any other suggestion is greatly appreciated. Also, if there is a better way to export the result. Please let me know.
It seems you are doing some sort of grid search/parameter exploration. I would avoid classes and nested loops in this case.
To setup one list of all parameters, you can use itertools.product
, for example:
from itertools import product
for i, j in product(range(10), range(20)):
# run calculations
To iterate over multiple values of the parameters in parallel, I would use delayed
:
import dask
from itertools import product
@dask.delayed
def try_calc(i,j,k):
df = pd.read_csv(my_csv_file)
# run calculations
df.to_csv(results_file)
results = dask.compute([
try_calc(i,j,k) for i,j,k in product(range(10), range(20), range(30))
])