Python - RGBA image non-zero pixels extracting - numpy mask speed up

I need to extract non-zero pixels from RGBA Image. Code below works, but because I need to deal with really huge images, some speed up will be salutary. Getting "f_mask" is the longest task. Is it possible to somehow make things work faster? How to delete rows with all zero values ([0, 0, 0, 0]) faster?

import numpy as np
import time


img_size = (10000, 10000)
img = np.zeros((*img_size, 4), float)  # Make RGBA image

# Put some values for pixels [float, float, float, int]
img[0][1] = [1.1, 2.2, 3.3, 4]
img[1][0] = [0, 0, 0, 10]
img[1][2] = [6.1, 7.1, 8.1, 0]


def f_img_to_pts(f_img):  # Get non-zero rows with values from whole img array
    f_shp = f_img.shape
    f_newshape = (f_shp[0]*f_shp[1], f_shp[2])
    f_pts = np.reshape(f_img, f_newshape)
    f_mask = ~np.all(f_pts == 0, axis=1)
    f_pts = f_pts[f_mask]
    return f_pts


t1 = time.time()
pxs = f_img_to_pts(img)
t2 = time.time()
print('PIXELS EXTRACTING TIME: ', t2 - t1)
print(pxs)

Solution

Analysis

~np.all(f_pts == 0, axis=1) is sub-optimal for several reasons:

First of all, the operation is memory-bound (typically by the RAM bandwidth) because the image is huge (400 MB in memory) -- even the boolean masks are huge (100 MB). Moreover, Numpy creates multiple temporary arrays: one for the result of f_pts == 0, one for np.all(...) and even one for ~np.all(...). Each array is fully stored in RAM and read back which is inefficient. Even worst: newly allocated arrays are typically not directly mapped in physical memory due to the way virtual memory works and so the expensive overhead of page-faults needs to be paid for each temporary array. Last but not least, Numpy is not optimized for computing array where the main dimension (ie. generally the last) is tiny, like here (4 items). This can be fixed by using a better memory layout though. See AoS vs SoA for more informations.

Note that np.zeros reserve some zeroized space in virtual memory but it does not physically map it. Thus, the first call to f_img_to_pts is slower because of page faults. In real-world application, this is generally not the case because writing zeros in img or reading it once causes the whole page to be physically mapped (ie. page-faults). If this is not the case, then you should certainly use sparse matrices instead of dense ones. Let's assume img is already mapped in physical memory (this can be done by running the code twice or by just calling img.fill(0) after np.zeros.

Also please note that the dtype float is actually 64-bit float and not 32-bit ones. 64-bit floats are generally more expensive, especially in memory-bound codes since they take twice more space. You should certainly use 32-bit ones for image computations since they nearly never require such a very-high precision. Let's assume the input array is a 32-bit one now.

Solutions

One way to reduce the overheads is to iterate row by row (loops are not so bad as they seems here, quite the opposite actually if done correctly). While this should be faster, this is still sub-optimal. There is no way to make this much faster than that because of the way Numpy is actually designed.

To make this even faster, one can use modules compiling Python functions to native ones like Numba or Cython. With them, we can easily avoid the creation of temporary arrays, specialize the code for 4-channel images and even use multiple threads. Here is the resulting code:

import numpy as np
import time
import numba as nb

img_size = (10000, 10000)
img = np.zeros((*img_size, 4), np.float32)  # Make RGBA image
img.fill(0)

# Put some values for pixels [float, float, float, int]
img[0][1] = [1.1, 2.2, 3.3, 4]
img[1][0] = [0, 0, 0, 10]
img[1][2] = [6.1, 7.1, 8.1, 0]

# Compute ~np.all(f_pts==0,axis=1) and assume most items are 0
@nb.njit('(float32[:,:,::1],)', parallel=True)
def f_img_to_pts_optim(f_img):  # Get non-zero rows with values from whole img array
    n, m, c = f_img.shape
    assert c == 4 # For sake of performance
    f_mask = np.zeros(n * m, dtype=np.bool_)
    f_pts = np.reshape(f_img, (n*m, c))
    for ij in nb.prange(n * m):
        f_mask[ij] = (f_pts[ij, 0] != 0) | (f_pts[ij, 1] != 0) | (f_pts[ij, 2] != 0) | (f_pts[ij, 3] != 0)
    f_pts = f_pts[f_mask]
    return f_pts

t1 = time.time()
pxs = f_img_to_pts_optim(img)
t2 = time.time()
print('PIXELS EXTRACTING TIME: ', t2 - t1)
print(pxs)

Results

Here are timings on a 10-core Skylake Xeon CPU:

Initial code (64-bit):          1720 ms
Initial code (32-bit):          1630 ms
Optimized code (sequential):     323 ms
Optimized code (parallel):       182 ms   <----------

The optimized code is 9 times faster than the initial code (on the same input type). Note that the code does not scale with the number of core since reading the input is a bottleneck combined with the sequential f_pts[f_mask] operation. One can build f_pts[f_mask] using multiple threads (and it should be about 2 times faster) but this is rather complicated to do (especially with Numba).