python multithreading multiprocessing shared-memory large-data

multi-processing or multi-threading my function, how?

I am very new to python. I need to simulate a simple water balance in rain tanks using the following function:

def rain_tank_model(rain, water_demand,roof_area, tank_size, household_name):

    # rain and water_demand time series are numpy arrays with over than 8 million recordings. 
    # each houshold has installed a rain tank with a specific size

    v = []  # water volume in tanks
    spill = [] # amount of water spills from rain tank
    unmet_demand = [] # amount of unmet water demand
    volume = 0.0 # stored volume at the start of the simulation
    for i in range(len(rain)):
        volume += rain[i] * roof_area - water_demand[i]
        if volume < 0. : #volume cannot be negative
            unmet_demand.append(volume * -1)
            volume = 0
            v.append(volume)
            spill.append(0.)
        if volume > tank_size: #water should spill from the tank
            spill.append(volume - tank_size)
            volume = tank_size
            v.append(volume)
            unmet_demand.append(0.)
        else:
            spill.append(0.)
            v.append(volume)
            unmet_demand.append(0.)

    file = open(str(household_name)+".txt", 'w')
    for i in range(len(v)):
        line =str(v[i])+"\t"+str(spill[i])+"\t"+str(unmet_demand[i])+"\n"
        file.write(line)
    file.close()

I need to run this function for 50,000 houses each have a specific rain tank size, roof area and water demand times series. I can do this by putting the function in a loop and iterate through the houses. Since each simulation is totally independent (they only need to have access to the same input rain array), I was thinking maybe I can use multi-threading or multi-processing in python to speed-up the simulation. I read about the differences between them but couldn't figure out which one I should use.

I tried multi-processing (pool and map function) to parallel a simplified version of the function that only takes the rain numpy array as input (assuming the tank size and roof area are the same for each house and water demand is always constant. The reason for simplifying is that I couldn't understand how to introduce multiple arguments. I had 20 houses to simulate. The looping method was significantly faster than the multi-processing method. I tried different number of pools from 2 to 20. I tried to share the rain data between the processes using the manage option but wasn't successful. I read a lot but they were very advanced and difficult to understand. Would appreciate a hint on how to parallel the function or any reference to similar examples.

Solution

The short answer is:

If your function is CPU-bound - use multiprocessing, if IO-bound - use multithreading.

A bit longer answer:

Python has a great feature called GIL, this lock provides huge restriction: one file can be interpret by one thread at one moment of time. So if you have a lot of calculations, multithreading will look like parallel execution, but in fact, only one thread will be active at specific moment.

As a result multithreading is good for IO bound operations, like, for example, data downloading, you can set file to download in one thread and do other operations in different, not to wait downloading to finish.

So, if you want to perform parallel calculations, it is better to use multiprocessing. But you should not forget that each process has its own RAM (in multithreading RAM is shared between threads).

UPD

There are ways to have shared memory between processes, more information you can find here: https://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes.