Search code examples
pythonmultiprocessingiterableconcurrent.futuresnon-static

How do I make Python respect iterable fields when multiprocessing?


Apologies if this is a dumb question, but I've not found an elegant workaround for this issue yet. Basically, when using the concurent.futures module, non-static methods of classes look like they should work fine, I didn't see anything in the docs for the module that would imply they wouldn't work fine, and the module produces no errors when running - and even produces the expected results in many cases!

However, I've noticed that the module seems to not respect updates to iterable fields made in the parent thread, even when those updates occur before starting any child processes. Here's an example of what I mean:

import concurrent.futures


class Thing:
    data_list = [0, 0, 0]
    data_number = 0

    def foo(self, num):
        return sum(self.data_list) * num

    def bar(self, num):
        return num * self.data_number


if __name__ == '__main__':
    thing = Thing()
    thing.data_list[0] = 1
    thing.data_number = 1

    with concurrent.futures.ProcessPoolExecutor() as executor:
        results = executor.map(thing.foo, range(3))
        print('result of changing list:')
        for result in results:
            print(result)

        results = executor.map(thing.bar, range(3))
        print('result of changing number:')
        for result in results:
            print(result)

I would expect the result here to be

result of changing list:
0
1
2
result of changing number:
0
1
2

but instead I get

result of changing list:
0
0
0
result of changing number:
0
1
2

So for some reason, things work as expected for the field that's just an integer, but not at all as expected for the field that's a list. The implication is that the updates made to the list are not respected when the child processes are called, even though the updates to the simpler fields are. I've tried this with dicts as well with the same issue, and I suspect that this is a problem for all iterables.

Is there any way to make this work as expected, allowing for updates to iterable fields to be respected by child processes? It seems bizarre that multiprocessing for non-static methods would be half-implemented like this, but I'm hoping that I'm just missing something!


Solution

  • The problem has nothing to do with "respecting iterable fields", but it is a rather subtle issue. In your main process you have:

    thing.data_list[0] = 1 # first assignment
    thing.data_number = 1 # second assignmment
    

    Rather than:

    Thing.data_list[0] = 1 # first assignment
    Thing.data_number = 1 # second assignment
    

    As far as the first assignment is concerned, there isn't any material difference because with either version you are not modifying a class attribute but rather an element within a list that happens to be referenced by a class attribute. In other words, Thing.data_list is still pointing to the same list; this reference has not been changed. This is an important distinction.

    But in the second assignment with your version of the code you have essentially modified a class attribute via the instance's self reference. When you do that, you are creating a new instance attribute with the same name data_number.

    Your class members functions foo and bar are attempting to access class attributes via self. The Thing instance, thing will be pickled across to the new address space but in the new address space when the Thing is un-pickled, by default new class attributes will be created and initialized to their default values unless you add special pickle rules. But instance attributes should be successfully transmitted, such as your newly created data_number. And that's why the 'result of changing number:' prints out as you expected, i.e. you are actually accessing the instance attribute data_number in bar.

    Change bar to the following and you will see that everything will print out as 0:

        def bar(self, num):
            return num * Thing.data_number