Search code examples
pythonmultithreadingmultiprocessing

Conceptually - multiprocessing on i/o bound problems is overkill?


I understand that multithreading is optimal for i/o bound problems and multiprocessing is optimal for cpu bound problems. I also understand that multiprocessing proceeded multithreading and I want to understand the motivation for why multithreading is necessary.

My current perspective

You can apply multiple processors to an i/o bound problem, but this would be inefficient, since more processing power is not the bottleneck. However, by running the processes in parallel, you will benefit from the i/o operations running in parallel and the program will have performance gains. The motivation for multithreading is to make more efficient use of the processor since the utilization is low and infrequent in i/o bound problems. Multithreading therefore accomplishes the same objective as multiprocessing but is more efficient, since it removes the need to utilize multiple cores.

Please correct any misunderstandings you can see and thank you so much.


Solution

  • Threads in Python are somewhat crippled in a way that prevents any single Python process from making effective use of more than one CPU, no matter how many CPUs your system has. (Google for "Python" and "GIL" to learn more.)

    You can't use multithreading in Python as a way to achieve parallel processing.

    I want to understand the motivation for why multithreading is necessary.

    Technically, it is not necessary. You can't use multithreading as a way to achieve parallelism, and any other program behavior that you can achieve by using threads can also be achieved without threads...

    ...But the code might look uglier.

    Threads are useful when several different "activities" are happening in the same program, and those activities are driven by events that don't synchronize with each other. Each thread in a multi-threaded program can operate on its own schedule, and its code looks a lot like the code of a single-threaded program that only performs one "activity." In other words, the code of each thread looks a lot like the kind of code we all learned to write when we were just beginners.

    One alternative to using threads is "event-driven" programming. In an event-driven program, there is one master loop that continually checks for all of the different events that can come from different sources (e.g., different network sockets having data that are available to read). The event loop has to know which activity each different event belongs to, and then it has to call an appropriate "handler" to drive that activity. The state of each activity must be explicitly maintained in data structures that the handlers know about, and it all looks* more complicated than multi-threaded code. (In multi-threaded code, most of an activity's state is implicit in the thread's "context," that is to say, in the call stack and all the local variables.)

    There are other models besides multithreading and event-driven code for these kinds of multi-activity programs. They have names like "async" and "actor model," but I don't have experience with those, and I can't say much about them.

    Multithreading therefore accomplishes the same objective as multiprocessing but is more efficient, since it removes the need to utilize multiple cores.

    That doesn't sound quite right. The only reason you ever need to use more than one CPU core is when your application requires more CPU cycles than one core can deliver. In most programming languages, you can achieve that by spawning multiple threads or multiple processes, but in Python in particular, becuase of that "GIL," (You Googled that, right?) threads are not a viable option for exploiting multiple cores.


    * Beware. Multi-threaded code is easier to read than event-driven code, but it's not necessarily easier to write it without making subtle mistakes. Multi-threading is tricky!