python numpy performance loops partitioning

bad performance with partitioning of numpy arrays

I am new with numpy arrays and running into a performance issue,

processing of 3M rows takes around 8min and I wondering, whether the partitioning of the numpy arrays as shown below is the best way to process the results of the numpy array,

   import re, math, time
   import numpy as np
   from tqdm import tqdm

   hdf5_array=np.random.rand(3000000, 3, 4, 8, 1, 1, 1, 2)
   ndarray = np.squeeze(hdf5_array)
   print (hdf5_array.shape, ndarray.shape)
   num_elm = ndarray.shape[0]
   num_iter = ndarray.shape[2]
   num_int_points = ndarray.shape[3]
   res_array = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
   for i, row in enumerate(tqdm(ndarray)):
           for xyz in range(3):
               xyz_array = np.squeeze(np.take(row,[xyz],axis=0),axis=0)
               for iter in range(num_iter):
                   iter_row = np.squeeze(np.take(xyz_array,[iter],axis=0), axis=0)
                   mean_list = np.mean(iter_row, axis=0)
   print (type(res_array), res_array.ndim, res_array.dtype, res_array.shape)

finally a mean value of results should be created and saved into a new array, but maybe also the nested loops are the problem but I assume that can not be avoided?

maybe someone has a good hint in what direction should I go to improve the performance?

Solution

The nested loops are certainly killing your performance.

We can directly perform this computation with:

%%time

res_array_direct = np.swapaxes(np.swapaxes(np.mean(ndarray, axis=3), 0, 1), 0, 2)

with timing

CPU times: total: 6.86 s
Wall time: 6.84 s

This is incredibly fast compared to the nested loops because it takes full advantage of NumPy being written in C. Once you introduce the nested loops, you are performing Python loops and operations directly which is far less efficient.

Summarizing the timing:

Direct : 6.48 s
1 Loop : 39.9 s
2 Loops: 124 s = 2 min 4 s
3 Loops: 473 s = 7 min 53 s

Details below:

We can see the progressive effect of the loops. Let's add one loop back in:

%%time

res_array_1 = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
for i, row in enumerate(tqdm(ndarray)):
    res_array_1[:, i, :, :] = np.swapaxes(np.mean(row, axis=2), 0, 1)

print(np.allclose(res_array_direct, res_array_1))

This single, manual loop versus the vectorization takes us from ~7s to ~40s

100%|██████████| 3000000/3000000 [00:38<00:00, 77730.88it/s]
True
CPU times: total: 39.9 s
Wall time: 39.6 s

With the second, manual loop we have code:

%%time

res_array_2 = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
for i, row in enumerate(tqdm(ndarray)):
    for xyz in range(3):
        xyz_array = np.squeeze(np.take(row,[xyz],axis=0),axis=0)
        res_array_2[:, i, xyz, :] = np.mean(xyz_array, axis=1)

print(np.allclose(res_array_direct, res_array_2))

and output

100%|██████████| 3000000/3000000 [02:03<00:00, 24387.97it/s]
True
CPU times: total: 2min 4s
Wall time: 2min 4s

Up to 2 minutes! Finally, with all 3 loops you have, we get

%%time

res_array_3 = np.zeros([num_iter, num_elm, 3, 2], dtype=np.float32)
for i, row in enumerate(tqdm(ndarray)):
    for xyz in range(3):
        xyz_array = np.squeeze(np.take(row,[xyz],axis=0),axis=0)
        for iter in range(num_iter):
            iter_row = np.squeeze(np.take(xyz_array,[iter],axis=0), axis=0)
            mean_list = np.mean(iter_row, axis=0)
            res_array_3[iter, i, xyz, :] = mean_list

print(np.allclose(res_array_direct, res_array_3))

and output

100%|██████████| 3000000/3000000 [07:52<00:00, 6348.42it/s]
True
CPU times: total: 7min 57s
Wall time: 7min 53s