Search code examples
pythonparallel-processingmultiprocessingmontecarlo

Multiprocessing not using whole CPU


I'm testing python's module "multiprocessing". I'm trying to compute pi using a montecarlo technique using my 12 threads ryzen 5 5600.

The problem is that my cpu is not fully used, instead only 47% is used. I leave you my code below, changing the value of n_cpu leads to not so different core usage, instead increasing N by 1 order of magnitude can increase the load up to 77%... but i believed that N shouldn't affect the number of processes... Please help me understand how to correctly parallelize my code, thanks.

import random
import math
import numpy as np
import multiprocessing
from multiprocessing import Pool

def sample(n):
    n_inside_circle = 0
    for i in range(n):
        x = random.random()
        y = random.random()
        if x**2 + y**2 < 1.0:
            n_inside_circle += 1
    return n_inside_circle

N_test=1000
N=12*10**4
n_cpu = 12
pi=0

for j in range(N_test):
    part_count=[int(N/n_cpu)] * n_cpu
    pool = Pool(processes=n_cpu)
    results = pool.map(sample, part_count)
    pool.close()
    pi += sum(results)/(N*1.0)*4

print(pi/N_test)

Solution

  • The lack of cpu use is because you are sending chunks of data to multiple new process pools instead of all at once to a single process pool.

    simply using

    pool = Pool(processes=n_cpu)
    for j in range(N_test):
        part_count=[int(N/n_cpu)] * n_cpu
        results = pool.map(sample, part_count)
        pi += sum(results)/(N*1.0)*4
    pool.close()
    

    should have some speed up

    To optimize this further

    We can change the way the jobs are split up to have more samples for a single process.

    We can use Numpy's vectorized random functions that will run faster than random.random().

    Finally for the last bit of speed, we can use numba with a threadpool to reduce overhead even more.

    import time
    import numpy as np
    from multiprocessing.pool import ThreadPool
    from numba import jit
    
    @jit(nogil=True, parallel=True, fastmath=True)
    def sample(n):
        x = np.random.random(n)
        y = np.random.random(n)
        inside_circle = np.square(x) + np.square(y) < 1.0
        return int(np.sum(inside_circle))
    
    total_samples = int(3e9)
    function_limit = int(1e7)
    n_cpu = 12
    pi=0
    
    assert total_samples%function_limit == 0
    
    start = time.perf_counter()
    with ThreadPool(n_cpu) as pool:
        part_count=[function_limit] * (total_samples//function_limit)
        results = pool.map(sample, part_count)
        pi = 4*sum(results)/(total_samples)
    end = time.perf_counter()
    print(pi)
    print(round(end-start,3), "seconds taken") 
    

    resulting in

    3.141589756
    6.982 seconds taken