The code I wrote is to do some data analysis work, and it has been working well for several months now. But the size of my source data increased significantly recently, and I see that the code hangs now without errors at around the same point in execution (but not always at the same point).
The code looks like this:
def submission_loop(data, submission):
# No loops in this function
# Do some data analysis
return result
def data_loop(arg1, arg2, data_row):
# Check this marker against all the criteria
results = []
for data in data_row:
for submission in submissions:
results.append(submission_loop(data, submission))
# Do something with result here
return results
if __name__ == '__main__':
with concurrent.futures.ProcessPoolExecutor(max_workers=cpu_count()) as executor:
chunksize = max(1, int(len(data_rows)/cpu_count()))
results = executor.map(functools.partial(data_loop, *args), data_rows, chunksize=chunksize)
results = list(results)
Note the three levels of loops.
Now, I've tested this on 3 machines:
python:3.8.5
running on an Ubuntu 20 host.python:3.8.5
running on a Windows 10 host using Docker Desktop.python 3.8.5
.On both 1 and 2, the above mentioned issue is seen consistently. On 3, the task completes successfully.
I changed this to use ThreadPoolExecutor
and the issue does not resolve, which makes me say that the number of cores is irrelevant here. If I remove concurrent.futures
usage and use a serial loop, it works perfectly.
Is this a bug with concurrent.futures
?
Further to my previous comment. I noticed that pebble
also was causing similar issues. I switched to futureproof
and the issue has been resolved.