Search code examples
multithreadingpython-3.xmultiprocessingpython-multithreadingpython-multiprocessing

Differences between python parallel threads & processes


Based on my current understanding, a process is a collection of instructions along with all the resources it uses while it is running. This includes the code, input/output, resources, memory, file handles, and more. In other words, it encompasses everything required for the execution of a program.

# this script, while running as a whole, is considered a process

print('hello world')

with open('something.txt', 'a') as file_handle:
    for i in range(500):
        file_handle.write('blablabla')
print('job done!')

To utilize my computer's processing power more efficiently, I can spawn additional processes or threads. Which one should I choose? How do they compare to the simple Python script process analogy? Is spawning another process similar to recalling the entire script with a different filename?

# changed filename (is this "another process?")

print('hello world')

with open('something_else.txt', 'a') as file_handle:
    for i in range(500):
        file_handle.write('blablabla')
print('job done!') 

I also get the vague idea that a single process can contain multiple threads, would it just be the equivalent of loading a bunch of more "conceptual" for loops then?

# like would this be a "thread" a barebones "subset" of an entire program?

with open('something.txt', 'a') as file_handle:
    for i in range(500):
        file_handle.write('blablabla')

What are the key differences between processes and threads? Online sources suggest that processes are more autonomous and resource-intensive, while threads are more lightweight and able to share memory with one another. But what does this mean in practice? Why can't processes also share memory? If threads are able to share memory, why can't I access variables from different threads that are spawned from the same script (e.g. from thread_a import var_data)?

Lastly, what computes what exactly? Does a CPU compute threads or processes, or is it a broader term that includes multiple cores, etc? Do cores compute processes or threads?


Summary:

  1. Using a simple python script as an example for a process, what would the equivalent of spawning another process/thread be? (e.g. duplicate script/subset of a script/some section of code only)

  2. How are processes fundamentally different from threads, what is an example of processes being able to do something that threads cannot?

  3. Why is memory/data often described as "harder to share" in processes than threads? and how do threads share data anyways?

  4. Do CPUs compute threads or processes. Do cores compute threads or processes?

  5. Can you provide general guidelines and examples for when to use certain things? Is there a rule of thumb for threads vs processes in python?


Solution

  • To start answering this, you must understand what is python GIL. Basically Python is designed to let any part of the code access memory. To avoid issues (such as multiple call to the same memory at the same time...), there is a Lock that forbids 2 tasks to be executed at the same time. So this is why python is purely procedural, executing tasks one after the other.

    In modern programming, there is a will to better use the multi-core processors, and thus parallelize the programming to improve performance. Because of the GIL, there is 2 workaround:

    • Threading is a module that allow to spawn multiple tasks "at the same time" in different threads. The catch is that it's not really at the same time, but will be cut into atomic tasks, and switch between the different tasks. BUT you will NEVER have 2 tasks at the same time, so you can still share memory like usual, that's why it's simple.

    • multiprocessing on the other hand, allows you to spawn real processes, which which will work simultaneously. BUT the price is that you can't safely have shared memory between these processes (in the classic way). There is no problem in having multiple processes with multiple threads in it. You are not completely alone though. There is a few ways to communicate safely between processes, by using a Lock for instance. You can see more on this here.

    To sum up, Threads and Process allows you to separate some tasks for others, giving you a way to improve your basic procedural programm. In some languages there is not much distinction in the way they work, but in Python the main thing to remember are :

    • Threads : Keep a shared memory, but not really parallel programming. This is useful if your code as waiting times, so you can do other stuff in between. If you are using 100% CPU, it will slow down your code because the execution will change often between task and cause an overhead.

    • Processes : A bit more difficult to implement, because you have to worry about the memory, which you normally don't in Python. The major upside is you can dramatically improve your performances if your code can be parallelized.