Python multiprocessing 2x slower than serial regardless of chunksize?

I am trying to modify the code found here to use multiprocessing: https://github.com/Sensory-Information-Processing-Lab/infotuple/blob/master/body_metrics.py

In the primal_body_selector function, I want to run lines 146-150 in parallel:

    for i in range(len(tuples)):
        a = tuples[i][0]
        B = tuples[i][1:]

        infogains[i] = mutual_information(M, a, B, M.shape[0]/10, dist_std, mu)

I believe this could lead to significant performance gains because the mutual_information function (code here) is mainly just matrix math so multiprocessing should really help.

However, when I try to use a simple pool = ThreadPool(processes=8) at the top of the file (called from a separate main() method, so pool is initialized on import) and then run the below command in place of the loop code listed above:

    def infogains_task_function(i, infogains, M, tuples, dist_std, mu):
        a = tuples[i][0]
        B = tuples[i][1:]

        infogains[i] = mutual_information(M, a, B, M.shape[0], dist_std, mu)

................

    # inside primal_body_selector
    pool.starmap(infogains_task_function,
                 [(i, infogains, M, tuples, dist_std, mu) for i in range(len(tuples))],
                 chunksize=80)

This code chunk is twice as slow as before (2 vs 4 seconds) as measured by time.time(). Why is that? Regardless of which chunk size I pick (tried 1, 20, 40, 80), it's twice as slow.

I originally thought serializing M and tuples could be the reason, but M is a 32x32 matrix and tuples is 179 tuples of length 3 each so it's really not that much data right?

Any help would be greatly appreciated.

Solution

Neither multiprocessing nor multithreading are magical silver bullets... You are right that multiprocessing is a nice tool for heavy computations on multi-processor systems (or multi-core processors which is functionaly the same).

The problem is that sharing operations on a number of threads or processes adds some complexity: you have to share or copy some memory, gather the results at some time and synchronize everything. So for simple tasks, the overhead is higher than the gain.

Worse, if you carefully split your tasks manually you may reduce the overhead. But when you use a generic tool (even a nicely crafted one like the Python standard library), you should be aware that its creators had to take care of many use cases and include a number of tests in their code... again with added complexity. But the manual way dramatically increases the development (and testing) cost...

What should you remember of that: use simple tools for simple tasks, on only go with multi-x things when they are really required. Some real use cases:

heavy loaded operational servers. The extra development cost is balanced by the ability to support heavy loads without crashes
really heavy computations (meteorological or oceanographic forecast models): when the lenght of a single run exceeds several hours, things have to be done ;-)
and the most important: multi-x things are optimization tools. Optimization always have a cost and you must carefully think about what really requires it, what can be done, and use benchmarks to make sure that the added complexity was worth it - nothing is ever evident here....

BTW, for simple computation tasks like matrix operations, numpy/scipy are probably far better suited than raw Python processing...