Search code examples
pythonforkcounterpython-multiprocessingpython-collections

return counter object in multiprocessing / map function


I have an python script running, that starts the same function in multiple threads. The functions creates and process 2 counters (c1 and c2). The result of all c1 counters from the forked processes should be merged together. Same to the results of all the c2 counters, returned by the different forks.

My (pseudo)-code looks like that:

def countIt(cfg)
   c1 = Counter
   c2 = Counter
   #do some things and fill the counters by counting words in an text, like
   #c1= Counter({'apple': 3, 'banana': 0})
   #c2= Counter({'blue': 3, 'green': 0})    

   return c1, c2

if __name__ == '__main__':
        cP1 = Counter()
        cP2 = Counter()
        cfg = "myConfig"
        p = multiprocessing.Pool(4)  #creating 4 forks
        c1, c2 = p.map(countIt,cfg)[:2]
        # 1.) This will only work with [:2] which seams to be no good idea
        # 2.) at this point c1 and c2 are lists, not a counter anymore,
        # so the following will not work:
        cP1 + c1
        cP2 + c2

Following the example above, I need a result like: cP1 = Counter({'apple': 25, 'banana': 247, 'orange': 24}) cP2 = Counter({'red': 11, 'blue': 56, 'green': 3})

So my question: how can I count things insight a forked process in order to aggregate each counter (all c1 and all c2) in the parent process?


Solution

  • You need to "unzip" your result by using for example a for-each loop. You will receive a list of tuples where each tuple is a pair of counters: (c1, c2).
    With your current solution you actually mix them up. You assigned [(c1a, c2a), (c1b, c2b)] to c1, c2 meaning that c1 contains (c1a, c2a) and c2 contains (c1b, c2b).

    Try this:

    if __name__ == '__main__':
            from contextlib import closing
    
            cP1 = Counter()
            cP2 = Counter()
    
            # I hope you have an actual list of configs here, otherwise map will
            # will call `countIt` with the single characters of the string 'myConfig'
            cfg = "myConfig"
    
            # `contextlib.closing` makes sure the pool is closed after we're done.
            # In python3, Pool is itself a contextmanager and you don't need to
            # surround it with `closing` in order to be able to use it in the `with`
            # construct.
            # This approach, however, is compatible with both python2 and python3.
            with closing(multiprocessing.Pool(4)) as p:
                # Just counting, no need to order the results.
                # This might actually be a bit faster.
                for c1, c2 in p.imap_unordered(countIt, cfg):
                    cP1 += c1
                    cP2 += c2