Summary:
I'm trying to use multiprocess and multiprocessing to parallelise work with the following attributes:
Errors:
My approach works for a small amount of work but fails with the following on larger tasks:
OSError: [Errno 24] Too many open files
Solutions tried
Running on a macOS Catalina system, ulimit -n
gives 1024
within Pycharm.
Is there a way to avoid having to change ulimit
? I want to avoid this as the code will ideally work out of the box for various sytems.
I've seen in related questions like this thread that recommend using .join and gc.collect in the comments, other threads recommend closing any opened files but I do not access files in my code.
import gc
import time
import numpy as np
from math import pi
from multiprocess import Process, Manager
from multiprocessing import Semaphore, cpu_count
def do_work(element, shared_array, sema):
shared_array.append(pi*element)
gc.collect()
sema.release()
# example_ar = np.arange(1, 1000) # works
example_ar = np.arange(1, 10000) # fails
# Parallel code
start = time.time()
# Instantiate a manager object and a share a datastructure
manager = Manager()
shared_ar = manager.list()
# Create semaphores linked to physical cores on a system (1/2 of reported cpu_count)
sema = Semaphore(cpu_count()//2)
job = []
# Loop over every element and start a job
for e in example_ar:
sema.acquire()
p = Process(target=do_work, args=(e, shared_ar, sema))
job.append(p)
p.start()
_ = [p.join() for p in job]
end_par = time.time()
# Serial code equivalent
single_ar = []
for e in example_ar:
single_ar.append(pi*e)
end_single = time.time()
print(f'Parallel work took {end_par-start} seconds, result={sum(list(shared_ar))}')
print(f'Serial work took {end_single-end_par} seconds, result={sum(single_ar)}')
The way to avoid changing the ulimit is to make sure that your process pool size does not increase beyond 1024. That's why 1000 works and 10000 fails.
Here's an example of managing the processes with a Pool which will ensure you don't go above the ceiling of your ulimit value:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, [1, 2, 3]))
https://docs.python.org/3/library/multiprocessing.html#introduction
other threads recommend closing any opened files but I do not access files in my code
You don't have files open, but your processes are opening file descriptors which is what the OS is seeing here.