Search code examples
pythonmultiprocessingdefaultdict

Using defaultdict with multiprocessing?


Just experimenting and learning, and I know how to create a shared dictionary that can be accessed with multiple proceses but I'm not sure how to keep the dict synced. defaultdict, I believe, illustrates the problem I'm having.

from collections import defaultdict
from multiprocessing import Pool, Manager, Process

#test without multiprocessing
s = 'mississippi'
d = defaultdict(int)
for k in s:
    d[k] += 1

print d.items() # Success! result: [('i', 4), ('p', 2), ('s', 4), ('m', 1)]
print '*'*10, ' with multiprocessing ', '*'*10

def test(k, multi_dict):
    multi_dict[k] += 1

if __name__ == '__main__':
    pool = Pool(processes=4)
    mgr = Manager()
    multi_d = mgr.dict()
    for k in s:
        pool.apply_async(test, (k, multi_d))

    # Mark pool as closed -- no more tasks can be added.
    pool.close()

    # Wait for tasks to exit
    pool.join()

    # Output results
    print multi_d.items()  #FAIL

print '*'*10, ' with multiprocessing and process module like on python site example', '*'*10
def test2(k, multi_dict2):
    multi_dict2[k] += 1


if __name__ == '__main__':
    manager = Manager()

    multi_d2 = manager.dict()
    for k in s:
        p = Process(target=test2, args=(k, multi_d2))
    p.start()
    p.join()

    print multi_d2 #FAIL

The first result works(because its not using multiprocessing), but I'm having problems getting it to work with multiprocessing. I'm not sure how to solve it but I think there might be due to it not being synced(and joining the results later) or maybe because within multiprocessing I cannot figure how to set defaultdict(int) to the dictionary.

Any help or suggestions on how to get this to work would be great!


Solution

  • You can subclass BaseManager and register additional types for sharing. You need to provide a suitable proxy type in cases where the default AutoProxy-generated type does not work. For defaultdict, if you only need to access the attributes that are already present in dict, you can use DictProxy.

    from multiprocessing import Pool
    from multiprocessing.managers import BaseManager, DictProxy
    from collections import defaultdict
    
    class MyManager(BaseManager):
        pass
    
    MyManager.register('defaultdict', defaultdict, DictProxy)
    
    def test(k, multi_dict):
        multi_dict[k] += 1
    
    if __name__ == '__main__':
        pool = Pool(processes=4)
        mgr = MyManager()
        mgr.start()
        multi_d = mgr.defaultdict(int)
        for k in 'mississippi':
            pool.apply_async(test, (k, multi_d))
        pool.close()
        pool.join()
        print multi_d.items()