Search code examples
pythonnumpyperformanceloopspartitioning

bad performance with partitioning of numpy arrays


I am new with numpy arrays and running into a performance issue,

processing of 3M rows takes around 8min and I wondering, whether the partitioning of the numpy arrays as shown below is the best way to process the results of the numpy array,

   import re, math, time
   import numpy as np
   from tqdm import tqdm

   hdf5_array=np.random.rand(3000000, 3, 4, 8, 1, 1, 1, 2)
   ndarray = np.squeeze(hdf5_array)
   print (hdf5_array.shape, ndarray.shape)
   num_elm = ndarray.shape[0]
   num_iter = ndarray.shape[2]
   num_int_points = ndarray.shape[3]
   res_array = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
   for i, row in enumerate(tqdm(ndarray)):
           for xyz in range(3):
               xyz_array = np.squeeze(np.take(row,[xyz],axis=0),axis=0)
               for iter in range(num_iter):
                   iter_row = np.squeeze(np.take(xyz_array,[iter],axis=0), axis=0)
                   mean_list = np.mean(iter_row, axis=0)
   print (type(res_array), res_array.ndim, res_array.dtype, res_array.shape)

finally a mean value of results should be created and saved into a new array, but maybe also the nested loops are the problem but I assume that can not be avoided?

maybe someone has a good hint in what direction should I go to improve the performance?


Solution

  • The nested loops are certainly killing your performance.

    We can directly perform this computation with:

    %%time
    
    res_array_direct = np.swapaxes(np.swapaxes(np.mean(ndarray, axis=3), 0, 1), 0, 2)
    

    with timing

    CPU times: total: 6.86 s
    Wall time: 6.84 s
    

    This is incredibly fast compared to the nested loops because it takes full advantage of NumPy being written in C. Once you introduce the nested loops, you are performing Python loops and operations directly which is far less efficient.

    Summarizing the timing:

    Direct : 6.48 s
    1 Loop : 39.9 s
    2 Loops: 124 s = 2 min 4 s
    3 Loops: 473 s = 7 min 53 s
    

    Details below:

    We can see the progressive effect of the loops. Let's add one loop back in:

    %%time
    
    res_array_1 = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
    for i, row in enumerate(tqdm(ndarray)):
        res_array_1[:, i, :, :] = np.swapaxes(np.mean(row, axis=2), 0, 1)
    
    print(np.allclose(res_array_direct, res_array_1))
    

    This single, manual loop versus the vectorization takes us from ~7s to ~40s

    100%|██████████| 3000000/3000000 [00:38<00:00, 77730.88it/s]
    True
    CPU times: total: 39.9 s
    Wall time: 39.6 s
    

    With the second, manual loop we have code:

    %%time
    
    res_array_2 = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
    for i, row in enumerate(tqdm(ndarray)):
        for xyz in range(3):
            xyz_array = np.squeeze(np.take(row,[xyz],axis=0),axis=0)
            res_array_2[:, i, xyz, :] = np.mean(xyz_array, axis=1)
    
    print(np.allclose(res_array_direct, res_array_2))
    

    and output

    100%|██████████| 3000000/3000000 [02:03<00:00, 24387.97it/s]
    True
    CPU times: total: 2min 4s
    Wall time: 2min 4s
    

    Up to 2 minutes! Finally, with all 3 loops you have, we get

    %%time
    
    res_array_3 = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
    for i, row in enumerate(tqdm(ndarray)):
        for xyz in range(3):
            xyz_array = np.squeeze(np.take(row,[xyz],axis=0),axis=0)
            for iter in range(num_iter):
                iter_row = np.squeeze(np.take(xyz_array,[iter],axis=0), axis=0)
                mean_list = np.mean(iter_row, axis=0)
                res_array_3[iter, i, xyz, :] = mean_list
    
    print(np.allclose(res_array_direct, res_array_3))
    

    and output

    100%|██████████| 3000000/3000000 [07:52<00:00, 6348.42it/s]
    True
    CPU times: total: 7min 57s
    Wall time: 7min 53s