Search code examples
pythonarraysnumpycythonscientific-computing

How to utilize all cores to accelerate simulation program based on numpy 3D array?


I'm using python to build a physics simulation model. Now I have two numpy 3d arrays arr_A and arr_B, with the size of 50*50*15 (may be enlarged to 1000*1000*50 in the future). And I want to see how these two arrays evolve based on some certain computation. I was trying to accelerate my program with parallel computing using my 12-core machine but the outcome was not so good. I finally realize that python is very slow in scientific computing.

Do I have to rewrite my program in C language? It's quite a tough job. I heard that Cython might be a solution. Should I use it? I really need some advice on how to accelerate my program since I'm a beginner in programming. I'm working on a win10 x64 machine with 12 cores.

My computation is something like this:
The value in arr_A is either 1 or 0. For every "1" in arr_A, I need to calculate a certain value according to arr_B. For example, if arr_A[x,y,z] == 1, C[x,y,z] = 1/(arr_B[x-1,y,z]+arr_B[x,y-1,z]+arr_B[x,y,z-1]+arr_B[x+1,y,z]+arr_B[x,y+1,z]+arr_B[x,y,z+1]). Then I use the minimum in array C as a parameter for a function. The function can slightly change arr_A and arr_B so that they can evolve. Then we compute the "result" again and the loop keeps going.

Notice that for every single C[x,y,z], many values in arr_B are involved. Otherwise I can do something like this:

C = arr_B[arr_A>0]**2

I hope the solution can be simple like that. But I can't find any feasible indexing methods except a triple-nested 'for' loop.

After reading this and some documents about multi-thread and multiprocessing, I tried to using multiprocessing but the simulation is not much faster.

I use slice like this for multiprocessing. To be specific, carrier_3d and potential_3d are arr_A and arr_B I mentioned above respectively. I put the slices into different sub-processes. The detail of functions is not given here but you can get the main idea.

chunk = np.shape(carrier_3d)[0] // cores
p = Pool(processes=cores)
for i in range(cores):
    slice_of_carrier_3d = slice(i*chunk, 
                                np.shape(carrier_3d)[0] if i == cores-1 else (i+1)*chunk+2)
    p.apply_async(hopping_x_section, args=(i, chunk,carrier_3d[slice_of_carrier_3d, :, :], 
                                               potential_3d[slice_of_carrier_3d, :, :]), 
                                    callback=paral_site_record)
p.close()
p.join() 

In case you want to know more about the computation, following code is basically how my computation works without multiprocessing. But I have explain the process above.

def vab(carrier_3d, potential_3d, a, b):
    try:
        Ea = potential_3d[a[0], a[1], a[2]]
        Eb = potential_3d[b[0], b[1], b[2]]
        if carrier_3d[b[0], b[1], b[2]] > 0:
            return 0
        elif b[2] < t_ox:
            return 0
        elif b[0] < 0 or b[1] < 0:
            return 0
        elif Eb > Ea:
            return math.exp(-10*math.sqrt((b[0]-a[0])**2+
                                              (b[1]-a[1])**2+(b[2]-a[2])**2)-
                                              q*(Eb-Ea)/(kB*T))
        else:
            return math.exp(-10*math.sqrt((b[0]-a[0])**2+
                                              (b[1]-a[1])**2+(b[2]-a[2])**2))
    except IndexError:
        return 0
#Given a point, get the vij to all 26 directions at the point
def v_all_drt(carrier_3d, potential_3d, x, y, z):
    x_neighbor = [-1, 0, 1]
    y_neighbor = [-1, 0, 1]
    z_neighbor = [-1, 0, 1]  
    v = []#v is the hopping probability
    drtn = []#direction
    for i in x_neighbor:
        for j in y_neighbor:
            for k in z_neighbor:
                v.append(vab(carrier_3d, potential_3d, 
                             [x, y, z], [x+i, y+j, z+k]))
                drtn.append([x+i, y+j, z+k])
    return np.array(v), np.array(drtn)
    #v is a list of probability(v_ij) hopping to nearest sites.
    #drt is the corresponding dirction(site).
def hopping():  
    global sys_time
    global time_counter
    global hop_ini
    global hop_finl
    global carrier_3d
    global potential_3d
    rt_min = 1000#1000 is meaningless. Just a large enough name to start
    for x in range(np.shape(carrier_3d)[0]):
        for y in range(np.shape(carrier_3d)[1]):
            for z in range(t_ox, np.shape(carrier_3d)[2]):
                if carrier_3d[x, y, z] == 1:
                    v, drt = v_all_drt(carrier_3d, potential_3d, x, y, z)
                    if v.sum() > 0:
                        rt_i = -math.log(random.random())/v.sum()/v0
                        if rt_i < rt_min:
                            rt_min = rt_i
                            v_hop = v
                            drt_hop = drt
                            hop_ini = np.array([x, y, z], dtype = int)
    #Above loop finds the carrier that hops. 
    #Yet we still need the hopping direction.
    rdm2 = random.random()
    for i in range(len(v_hop)):
        if (rdm2 > v_hop[:i].sum()/v_hop.sum()) and\
            (rdm2 <= v_hop[:i+1].sum()/v_hop.sum()):
                hop_finl = np.array(drt_hop[i], dtype = int)
                break      
    carrier_3d[hop_ini[0], hop_ini[1], hop_ini[2]] = 0
    carrier_3d[hop_finl[0], hop_finl[1], hop_finl[2]] = 1 
def update_carrier():
    pass
def update_potential():
    pass
#------------------------------------- 
carrier_3d = np.random.randn(len_x, len_y, len_z)
carrier_3d[carrier_3d>.5] = 1
carrier_3d[carrier_3d<=.5] = 0
carrier_3d = carrier_3d.astype(int)
potential_3d = np.random.randn(len_x, len_y, len_z)
while time_counter <= set_time:# set the running time of the simulation
    hopping() 
    update_carrier()
    update_potential()
    time_counter += 1

Solution

  • You can use numba to create a jit compiled version of your analysis function. This alone will be the biggest speedup to your code, and tends to work very well when your problem fits in to the constraints. You will have to write a more sophisticated analysis in your for loop, but I don't see any reason why what you've outlined wouldn't work. See the following code which shows a 330 fold speedup by compiling with numba. You can also specify certain numba functions to execute in parallel. However, the overhead associated with this only adds a speedup when the problem gets sufficiently large, so that is something you will have to consider for yourself

    from numpy import *
    from numba import njit
    
    def function(A, B):
        C = zeros(shape=B.shape)
        X, Y, Z = B.shape
        for x in range(X):
            for y in range(Y):
                for z in range(Z):
                    if A[x, y, z] == 1:
                        C[x, y, z] = B[x, y, z]**2
        return C
    
    cfunction = njit(function)
    cfunction_parallel = njit(function, parallel=True)
    
    X, Y, Z = 50, 50, 10
    A = random.randint(0, 2, size=X*Y*Z).reshape(X, Y, Z)
    B = random.random(size=X*Y*Z).reshape(X, Y, Z)
    
    _ = cfunction(A, B)  # force compilation so as not to slow down timers
    _ = cfunction_parallel(A, B)
    
    print('uncompiled function')
    %timeit function(A, B)
    
    print('\nfor smaller computations, the parallel overhead makes it slower')
    %timeit cfunction(A, B)
    %timeit cfunction_parallel(A, B)
    
    X, Y, Z = 1000, 1000, 50
    A = random.randint(0, 2, size=X*Y*Z).reshape(X, Y, Z)
    B = random.random(size=X*Y*Z).reshape(X, Y, Z)
    
    print('\nfor larger computations, parallelization helps')
    %timeit cfunction(A, B)
    %timeit cfunction_parallel(A, B)
    

    this prints:

    uncompiled function
    23.2 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    for smaller computations, the parallel overhead makes it slower
    77.5 µs ± 1.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    121 µs ± 2.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    for larger computations, parallelization helps
    138 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    40.1 ms ± 633 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)