Search code examples
pythonpython-3.xmultiprocessingpython-multiprocessing

multiprocessing pool map each worker running code outside __main__ block


import multiprocessing
import threading

counter = 1
print("Code outside __main__",counter)
lock = threading.Lock()
counter += 1

def foo(i):
  #print("Inside foo ",i)
  pass

if __name__ == '__main__':
  pool = multiprocessing.Pool(processes=10)
  pool.map(foo, range(100))

if you run this code from the terminal python run.py it prints out

Code outside __main__ 1
Code outside __main__ 1
Code outside __main__ 1
Code outside __main__ 1
Code outside __main__ 1
Code outside __main__ 1
Code outside __main__ 1
Code outside __main__ 1
Code outside __main__ 1

and if you uncomment the print on foo() you see sometimes the Code outside __main__ 1 is in between the foo() calls.

Why is it doing that?


import multiprocessing
import threading

counter = 1
print("Code outside __main__",counter)
counter += 1

def foo(i):
  global lock
  with lock:
    print("Inside foo ",i)

if __name__ == '__main__':
  lock = threading.Lock()
  pool = multiprocessing.Pool(processes=10)
  pool.map(foo, range(100))

If I declare the lock inside the __main__ block, it's undefined inside foo() even if I use global lock

Here's the output

Code outside __main__ 1
Code outside __main__ 1
Code outside __main__ 1
Code outside __main__ 1
Code outside __main__ 1
Code outside __main__ 1
Code outside __main__ 1
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Users\David\AppData\Local\Programs\Python\Python38-32\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\David\AppData\Local\Programs\Python\Python38-32\lib\multiprocessing\pool.py", line 48, in mapstar
    return list(map(*args))
  File "C:\Users\David\Documents\test.py", line 10, in foo
    with lock:
NameError: name 'lock' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test.py", line 16, in <module>
    pool.map(foo, range(100))
  File "C:\Users\David\AppData\Local\Programs\Python\Python38-32\lib\multiprocessing\pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\David\AppData\Local\Programs\Python\Python38-32\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
NameError: name 'lock' is not defined
Code outside __main__ 1
Code outside __main__ 1

This code is just a simplification, I'm trying to scrape a website and write to a file but I want to understand what's happening here.


Solution

  • Part 1

    sometimes the Code outside __main__ 1 is in between the foo() calls.

    Each child process in the 10 you create with multiprocessing.Pool will import * from the "main" file, which basically means executing the file. You will get this print for the main process, and then 10 for the children. Particularly with more and more children, some of the early birds may get around to processing inputs from the pool.map call before others are done initializing, so this is why they can be interleaved. Also during this import, each process gets it's own version of the counter variable, so it will always be 1.

    Part 2

    If I declare the lock inside the __main__ block, it's undefined inside foo() even if I use global lock

    foo is getting executed in a totally separate memory space. global can't automatically send objects to the memory of another process, and lock won't exist in theirs because they won't execute anything inside the if block (and they shouldn't). The children need to receive the lock as an argument and assign it to their own memory space. When using a Pool certain things like locks, queues, etc.. can only be passed as arguments to the initializer function (normal Processes don't have as many restrictions). You can then use the initializer to recieve the lock, and save it to the global space of the child's memory.

    import multiprocessing as mp
    from time import sleep
    #mp.Lock is the same as threading.Lock, so save an import here
    
    print("Code outside __main__")
    
    def foo(i):
        global my_lock
        with my_lock:
            sleep(1)#counting 1 at a time means lock is working to limit access to a resource
            print("code inside foo ", i)
          
    def init_worker(l):
        global my_lock
        my_lock = l
    
    if __name__ == '__main__':
        l = mp.Lock()
        with mp.Pool(processes=10, initializer=init_worker, initargs=(l,)) as pool:
            pool.map(foo, range(10))