Search code examples
pythonmultiprocessingpool

Passing a list of lists to multiprocessing.Pool doesn't seem to work


from multiprocessing import Pool

data_table = None
def init_data_table(my_data_table = [], *args):
    global data_table
    data_table = my_data_table

def process_data(index):
    # create data processor object and run cpu intensive task here
    return str(index) + " " +  data_table[index][0]

def main():
    # call db functions once and get data table from db
    data_table = ...
    pool = Pool(processes = 4, initializer=init_data_table, initargs=(data_table))
    x = pool.map(process_data, range(10))

The problem is when I try and pass the data_table and access it later on, it does not work . I get this error:

IndexError: list index out of range

I'm not sure if this is the correct way of passing a complex data structure such as a tuple or list of lists into the Pool() function, so it can be accessed by the forked child processes. Essentially it is a shared piece of data that I want to retrieve once only, as it is an expensive call to the db, and I want to make it accessible to the processes.

Any assistance would be greatly appreciated, thanks.


Solution

  • The documentation for multiprocessing.Pool says this about initializer:

    If initializer is not None then each worker process will call initializer(*initargs) when it starts.

    So in your case, it's calling init_data_table(*data_table). Because of the *, it's going to attempt to unpack your list of lists, taking each sublist and assigning it to a variable in the definition of init_data_table. You've defined it as

    def init_data_table(my_data_table=[], *args):
    

    So, when Python tries to unpack this, the first sublist ends up in my_data_table and all of the rest end up in a tuple, assigned to *args. To avoid this, you need to put your data_table into a tuple when you assign it to initargs. It actually looks like you tried to do this, but you forgot to include the trailing comma:

    pool = Pool(processes = 4, initializer=init_data_table, initargs=(data_table,))
    

    Then, Python ends up calling init_data_table(*(data_table,)), which will unpack your whole data_table list in my_data_table, leaving *args empty, which is what you really want to happen.