Possible Duplicate:
Multiprocessing launching too many instances of Python VM
I'm trying to use python multiprocess to parallelize web fetching, but I'm finding that the application calling the multiprocessing gets instantiated multiple times, not just the function I want called (which is a problem for me as the caller has some dependencies on a library that is slow to instantiate - losing most of my performance gains from parallelism).
What am I doing wrong or how is this avoided?
my_app.py:
from url_fetcher import url_fetch, parallel_fetch
import my_slow_stuff
my_slow_stuff.py:
if __name__ == '__main__':
import datetime
urls = ['http://www.microsoft.com'] * 20
results = parallel_fetch(urls, fn=url_fetch)
print([x[:20] for x in results])
class MySlowStuff(object):
import time
print('doing slow stuff')
time.sleep(0)
print('done slow stuff')
url_fetcher.py:
import multiprocessing
import urllib
def url_fetch(url):
#return urllib.urlopen(url).read()
return url
def parallel_fetch(urls, fn):
PROCESSES = 10
CHUNK_SIZE = 1
pool = multiprocessing.Pool(PROCESSES)
results = pool.imap(fn, urls, CHUNK_SIZE)
return results
if __name__ == '__main__':
import datetime
urls = ['http://www.microsoft.com'] * 20
results = parallel_fetch(urls, fn=url_fetch)
print([x[:20] for x in results])
partial output:
$ python my_app.py
doing slow stuff
done slow stuff
doing slow stuff
done slow stuff
doing slow stuff
done slow stuff
doing slow stuff
done slow stuff
doing slow stuff
done slow stuff
...
Python multiprocessing module for Windows behaves slightly differently because Python doesn't implement os.fork()
on this platform. In particular:
Safe importing of main module
Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).
Here, global class MySlowStuff
gets always evaluated by newly started child processes on Windows. To fix that class MySlowStuff
should be defined only when __name__ == '__main__'
.
See 16.6.3.2. Windows for more details.