python multiprocessing overhead-minimization

How to reduce time for multiprocessing in python

I am trying to build multiprocessing in python to reduce computation speed, but it seems like after multiprocessing, the overall speed of computation decreased significantly. I have created 4 different processes and split dataFrame into 4 different dataframe, which will be an input to each processes. After timing each process, it seems like the overhead cost is significant, and was wondering if there is way to reduce these overhead costs.

I am using windows7, python 3.5 and my machine has 8 cores.

def doSomething(args, dataPassed,):

    processing data, and calculating outputs

def parallelize_dataframe(df, nestedApply):
    df_split = np.array_split(df, 4)
    pool = multiprocessing.Pool(4)
    df = pool.map(nestedApply, df_split)
    print ('finished with Simulation')
    time = float((dt.datetime.now() - startTime).total_seconds())

    pool.close()
    pool.join()

def nestedApply(df):

    func2 = partial(doSomething, args=())
    res = df.apply(func2, axis=1)
    res =  [output Tables]
    return res

if __name__ == '__main__':

data = pd.read_sql_query(query, conn)

parallelize_dataframe(data, nestedApply)

Solution

I would suggest to use queues instead of providing your DataFrame as chunks. You need a lot of ressources to copy each chunk and it takes quite some time to do so. You could run out of memory if your DataFrame is really big. Using queues you could benefit from fast iterators in pandas. Here is my approach. The overhead reduces with the complexity of your workers. Unfortunately, my workers are far to simple to really show that, but sleep simulates complexity a bit.

import pandas as pd
import multiprocessing as mp
import numpy as np
import time


def worker(in_queue, out_queue):
    for row in iter(in_queue.get, 'STOP'):
        value = (row[1] * row[2] / row[3]) + row[4]
        time.sleep(0.1)
        out_queue.put((row[0], value))

if __name__ == "__main__":
    # fill a DataFrame
    df = pd.DataFrame(np.random.randn(1e5, 4), columns=list('ABCD'))

    in_queue = mp.Queue()
    out_queue = mp.Queue()

    # setup workers
    numProc = 2
    process = [mp.Process(target=worker,
                          args=(in_queue, out_queue)) for x in range(numProc)]

    # run processes
    for p in process:
        p.start()

    # iterator over rows
    it = df.itertuples()

    # fill queue and get data
    # code fills the queue until a new element is available in the output
    # fill blocks if no slot is available in the in_queue
    for i in range(len(df)):
        while out_queue.empty():
            # fill the queue
            try:
                row = next(it)
                in_queue.put((row[0], row[1], row[2], row[3], row[4]), block=True)  # row = (index, A, B, C, D) tuple
            except StopIteration:
                break
        row_data = out_queue.get()
        df.loc[row_data[0], "Result"] = row_data[1]

    # signals for processes stop
    for p in process:
        in_queue.put('STOP')

    # wait for processes to finish
    for p in process:
        p.join()

Using numProc = 2 it takes 50sec per loop, with numProc = 4 it is twice as fast.