Search code examples
pythonpandasmultiprocessingnested-loopsdask

How to use multiprocessing for multiple nested for loop in Python?


I have a class with a bunch of functions to check the data against a huge dataframe ~33gb. Each value from the variable is run against one of the column from the dataframe (lets say column D) which is then appended to the dataframe itself for other iterations to compute the value.

Anyway, i is run against df.D and j is run against df.D and the result of i so on and so on. I am trying to see what set of numbers will provide the best output. Below is snippet of how the code looks like.

program.py
class Test:
    def runTest():
       pass

    def run():
       runTest()
       bunch of if/else statements to check the data
       pd.to_csv to export the result

    def aa(int):
       calculation..

    def bb(int):
       do something

     ...

runTest.py
for i in range(10,25):
    for j in range(45,85):
        for k in range(6,16):
            for l in range(7,21):
                for m in range(65,75):
                    class hello(Test):
                        def runTest():
                            a = aa(i)
                            b = bb(j)
                            ...
                    
                    hello().run()

I have tried itertools.product to make a list of all the numbers from the range. But I do not know how to pluck those values in my program. I would like it to be scalable as the ranges will be much bigger and will be adding more parameters to test the program.

How do I run these nested for loops with dask or multiprocessing to minimize the time to run this task? or any other suggestion is greatly appreciated. Also, if there is a better way to export the result. Please let me know.


Solution

  • It seems you are doing some sort of grid search/parameter exploration. I would avoid classes and nested loops in this case.

    To setup one list of all parameters, you can use itertools.product, for example:

    from itertools import product
    
    for i, j in product(range(10), range(20)):
    # run calculations
    

    To iterate over multiple values of the parameters in parallel, I would use delayed:

    import dask
    from itertools import product
    
    @dask.delayed
    def try_calc(i,j,k):
        df = pd.read_csv(my_csv_file)
        # run calculations
        df.to_csv(results_file)
    
    results = dask.compute([
        try_calc(i,j,k) for i,j,k in product(range(10), range(20), range(30))
    ])