I'm testing python's module "multiprocessing". I'm trying to compute pi using a montecarlo technique using my 12 threads ryzen 5 5600.
The problem is that my cpu is not fully used, instead only 47% is used. I leave you my code below, changing the value of n_cpu leads to not so different core usage, instead increasing N by 1 order of magnitude can increase the load up to 77%... but i believed that N shouldn't affect the number of processes... Please help me understand how to correctly parallelize my code, thanks.
import random
import math
import numpy as np
import multiprocessing
from multiprocessing import Pool
def sample(n):
n_inside_circle = 0
for i in range(n):
x = random.random()
y = random.random()
if x**2 + y**2 < 1.0:
n_inside_circle += 1
return n_inside_circle
N_test=1000
N=12*10**4
n_cpu = 12
pi=0
for j in range(N_test):
part_count=[int(N/n_cpu)] * n_cpu
pool = Pool(processes=n_cpu)
results = pool.map(sample, part_count)
pool.close()
pi += sum(results)/(N*1.0)*4
print(pi/N_test)
The lack of cpu use is because you are sending chunks of data to multiple new process pools instead of all at once to a single process pool.
simply using
pool = Pool(processes=n_cpu)
for j in range(N_test):
part_count=[int(N/n_cpu)] * n_cpu
results = pool.map(sample, part_count)
pi += sum(results)/(N*1.0)*4
pool.close()
should have some speed up
To optimize this further
We can change the way the jobs are split up to have more samples for a single process.
We can use Numpy's vectorized random functions that will run faster than random.random().
Finally for the last bit of speed, we can use numba with a threadpool to reduce overhead even more.
import time
import numpy as np
from multiprocessing.pool import ThreadPool
from numba import jit
@jit(nogil=True, parallel=True, fastmath=True)
def sample(n):
x = np.random.random(n)
y = np.random.random(n)
inside_circle = np.square(x) + np.square(y) < 1.0
return int(np.sum(inside_circle))
total_samples = int(3e9)
function_limit = int(1e7)
n_cpu = 12
pi=0
assert total_samples%function_limit == 0
start = time.perf_counter()
with ThreadPool(n_cpu) as pool:
part_count=[function_limit] * (total_samples//function_limit)
results = pool.map(sample, part_count)
pi = 4*sum(results)/(total_samples)
end = time.perf_counter()
print(pi)
print(round(end-start,3), "seconds taken")
resulting in
3.141589756
6.982 seconds taken