Search code examples
pythonmulticore

What is some example code for demonstrating multicore speedup in Python on Windows?


I'm using Python 3 on Windows and trying to construct a toy example, demonstrating how using multiple CPU cores can speed up computation. The toy example is rendering of the Mandelbrot fractal.

So far:

  • I have avoided threading, since the Global Interpreter Lock prohibits multicores in this context
  • I'm ditching example code that won't work on Windows because it lacks the forking capability of Linux
  • Trying to use the "multiprocessing" package. I declare p=Pool(8) (8 is my number of cores) and using p.starmap(..) to delegate work. This is supposed to produce multiple "subprocesses" which windows will automatically delegate to different CPUs

However, I'm unable to demonstrate any speedup, whether due to overhead or no actual multiprocessing. Pointers to toy examples with demonstrable speedup would therefore be very helpful :-)

Edit: Thank you! This pushed me in the right direction and I've now got a working example that demonstrates a doubling of speed on a CPU with 4 cores.
A copy of my code with "lecture notes" here: https://pastebin.com/c9HZ2vAV

I settled on using Pool() but will later try out the "Process" alternative that @16num pointed out. Below is a code example for Pool():

    p = Pool(cpu_count())

    #Unlike map, starmap only allows 1 input. "partial" provides a workaround
    partial_calculatePixel = partial(calculatePixel, dataarray=data) 
    koord = []
    for j in range(height):
        for k in range(width):
            koord.append((j,k))

    #Runs the calls to calculatePixel in a pool. "hmm" collects the output
    hmm = p.starmap(partial_calculatePixel,koord)

Solution

  • It's very simple to demonstrate a multiprocessing speed up:

    import multiprocessing
    import sys
    import time
    
    # multi-platform precision clock
    get_timer = time.clock if sys.platform == "win32" else time.time
    
    def cube_function(num):
        time.sleep(0.01)  # let's simulate it takes ~10ms for the CPU core to cube the number
        return num**3
    
    if __name__ == "__main__":  # multiprocessing guard
        # we'll test multiprocessing with pools from one to the number of CPU cores on the system
        # it won't show significant improvements after that and it will soon start going
        # downhill due to the underlying OS thread context switches
        for workers in range(1, multiprocessing.cpu_count() + 1):
            pool = multiprocessing.Pool(processes=workers)
            # lets 'warm up' our pool so it doesn't affect our measurements
            pool.map(cube_function, range(multiprocessing.cpu_count()))
            # now to the business, we'll have 10000 numbers to quart via our expensive function
            print("Cubing 10000 numbers over {} processes:".format(workers))
            timer = get_timer()  # time measuring starts now
            results = pool.map(cube_function, range(10000))  # map our range to the cube_function
            timer = get_timer() - timer  # get our delta time as soon as it finishes
            print("\tTotal: {:.2f} seconds".format(timer))
            print("\tAvg. per process: {:.2f} seconds".format(timer / workers))
            pool.close()  # lets clear out our pool for the next run
            time.sleep(1)  # waiting for a second to make sure everything is cleaned up
    

    Of course, we're just simulating here 10ms-per-number calculations, you can replace cube_function with anything CPU taxing for a real-world demonstration. The results are as expected:

    Cubing 10000 numbers over 1 processes:
            Total: 100.01 seconds
            Avg. per process: 100.01 seconds
    Cubing 10000 numbers over 2 processes:
            Total: 50.02 seconds
            Avg. per process: 25.01 seconds
    Cubing 10000 numbers over 3 processes:
            Total: 33.36 seconds
            Avg. per process: 11.12 seconds
    Cubing 10000 numbers over 4 processes:
            Total: 25.00 seconds
            Avg. per process: 6.25 seconds
    Cubing 10000 numbers over 5 processes:
            Total: 20.00 seconds
            Avg. per process: 4.00 seconds
    Cubing 10000 numbers over 6 processes:
            Total: 16.68 seconds
            Avg. per process: 2.78 seconds
    Cubing 10000 numbers over 7 processes:
            Total: 14.32 seconds
            Avg. per process: 2.05 seconds
    Cubing 10000 numbers over 8 processes:
            Total: 12.52 seconds
            Avg. per process: 1.57 seconds
    

    Now, why not 100% linear? Well, first of all, it takes some time to map/distribute the data to the sub-processes and to get it back, there is some cost to context switching, there are other tasks that use my CPUs from time to time, time.sleep() is not exactly precise (nor it could be on a non-RT OS)... But the results are roughly in the ballpark expected for parallel processing.