python parallel-processing multiprocessing montecarlo

Multiprocessing not using whole CPU

I'm testing python's module "multiprocessing". I'm trying to compute pi using a montecarlo technique using my 12 threads ryzen 5 5600.

The problem is that my cpu is not fully used, instead only 47% is used. I leave you my code below, changing the value of n_cpu leads to not so different core usage, instead increasing N by 1 order of magnitude can increase the load up to 77%... but i believed that N shouldn't affect the number of processes... Please help me understand how to correctly parallelize my code, thanks.

import random
import math
import numpy as np
import multiprocessing
from multiprocessing import Pool

def sample(n):
    n_inside_circle = 0
    for i in range(n):
        x = random.random()
        y = random.random()
        if x**2 + y**2 < 1.0:
            n_inside_circle += 1
    return n_inside_circle

N_test=1000
N=12*10**4
n_cpu = 12
pi=0

for j in range(N_test):
    part_count=[int(N/n_cpu)] * n_cpu
    pool = Pool(processes=n_cpu)
    results = pool.map(sample, part_count)
    pool.close()
    pi += sum(results)/(N*1.0)*4

print(pi/N_test)

Solution

The lack of cpu use is because you are sending chunks of data to multiple new process pools instead of all at once to a single process pool.

simply using

pool = Pool(processes=n_cpu)
for j in range(N_test):
    part_count=[int(N/n_cpu)] * n_cpu
    results = pool.map(sample, part_count)
    pi += sum(results)/(N*1.0)*4
pool.close()

should have some speed up

To optimize this further

We can change the way the jobs are split up to have more samples for a single process.

We can use Numpy's vectorized random functions that will run faster than random.random().

Finally for the last bit of speed, we can use numba with a threadpool to reduce overhead even more.

import time
import numpy as np
from multiprocessing.pool import ThreadPool
from numba import jit

@jit(nogil=True, parallel=True, fastmath=True)
def sample(n):
    x = np.random.random(n)
    y = np.random.random(n)
    inside_circle = np.square(x) + np.square(y) < 1.0
    return int(np.sum(inside_circle))

total_samples = int(3e9)
function_limit = int(1e7)
n_cpu = 12
pi=0

assert total_samples%function_limit == 0

start = time.perf_counter()
with ThreadPool(n_cpu) as pool:
    part_count=[function_limit] * (total_samples//function_limit)
    results = pool.map(sample, part_count)
    pi = 4*sum(results)/(total_samples)
end = time.perf_counter()
print(pi)
print(round(end-start,3), "seconds taken")

resulting in

3.141589756
6.982 seconds taken