Search code examples
pythonpython-3.xmultithreadinghpc

concurrent.futures.ThreadPoolExecutor scaling poorly on HPC


I'm new to using HPC and had some questions regarding parallelization of code. I have some python code which I've successfully parallelized using multi-threading which works great on my personal machine and a server. However, I just got access to HPC resources at my university. The general section of code which I use looks like this:

# Iterate over weathers
    print(f'Number of CPU: {os.cpu_count()}')
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [executor.submit(run_weather, i) for i in range(len(weather_list))]

        for f in concurrent.futures.as_completed(futures):
            results.append(f.result())

On my personal machine, when running two weathers (the item I've parallelized) I get the following results:

Number of CPU: 8
done in 29.645377159118652 seconds

On the HPC when running on 1 node with 32 cores, I get the following results:

Number of CPU: 32
done in 86.95256996154785 seconds

So it is running almost two times as worse - like it's running serially with the same processing overhead. I also attempted switching the code to ProcessPoolExecutor, and it was much slower than ThreadPoolExecutor. I'm assuming this must be something with data transfer (I've seen that multiprocessing on multiple nodes attempts to pass the entire program, packages and all), but as I said I'm very new to HPC and the wiki provided by my university leaves much to be desired.


Solution

  • Adding threads will only slow down the execution of CPU-bound programs (as so much time is wasted acquiring and releasing locks). Unlike some other programming languages, in Python, threads are bound by a global interpreter lock (GIL), so you may not see the same behavior you may have seen in other languages.

    As a rule of thumb for Python: multiple processes are good for CPU-bound work and spreading workloads across multiple cores. Threads are great for things like concurrent IO. Threads will not speed up any kind of computational work.

    Additionally, running n number threads equal to the number of CPUs probably does not make sense for your program. Particularly because threading won't spread work across CPU cores, anyhow. More threads adds more overhead in acquiring and releasing the GIL, hence why your execution times are higher the more threads you use.

    Next steps:

    1. Determine if threading is speeding up the execution of your program at all.

    Try running your code single-threaded, then try again with a small number of threads. Does it speed up at all?

    Chances are, you will find threading is not helping you at all here.

    1. Explore multiprocessing instead of threading. I.E. you should use the ProcessPoolExecutor instead of threadpool.

    Keep in mind, processes have their own caveats just like threads.

    Here is a useful reference to learn more.