python numpy opencv optimization graphics

Optimising array addition (y, x, RGBA)

I have two arrays A, B that are both of the shape (42, 28, 4) where:

42 : y_dim
28 : x_dim
4  : RGBA
## I'm on MacBook Air M1 2020 16Gb btw

I want to combine these through a similar process to this:

def add(A, B):
    X = A.shape[1]
    Y = A.shape[0]
    alpha = A[..., 3] / 255

    B[..., :3] = blend(B[..., :3], A[..., :3], alpha.reshape(Y, X, 1))    

    return B

def blend(c1, c2, alpha):
    return np.asarray((c1 + np.multiply(c2, alpha))/(np.ones(alpha.shape) + alpha), dtype='uint8')

But currently this is a bit too slow (~20 ms with 250 images overlayed on top of a base array [1]) for my liking and if you have any ways to improve this (preferably with 8bit alpha support) I'd be happy to know.

[1]:

start = time.time()
for obj in l: # len(l) == 250
    _slice = np.index_exp[obj.y * 42:(obj.y+1) * 42, obj.x * 28 : (obj.x+1) * 28, :]
    self.pixels[_slice] = add(obj.array, self.pixels[_slice])

stop = time.time()
>>> stop - start # ~20ms

I've semi-tried the following:

# cv2.addWeighted() in add()
## doesn't work because it has one alpha for the whole image,
## but I want to have indiviual alpha control for each pixel

B = cv.addWeighted(A, 0.5, B, 0.5, 0)

# np.vectorize blend() and use in add()
## way too slow because as the docs mention it's basically just a for-loop

B[..., :3] = np.vectorize(blend)(A[..., :3], B[..., :3], A[..., 3] / 255)

# changed blend() to the following
def blend(a, b, alpha):
    if alpha == 0:
        return b
    elif alpha == 1:
        return a
    
    return (b + a * alpha) / (1 + alpha)

# moved the blend()-stuff to add()
## doesn't combine properly; too dark with alpha

np.multiply(A, alpha.reshape(Y, X, 1)) + np.multiply(B, 1 - alpha.reshape(Y, X, 1))

I've also tried some bitwise stuff but my monkey brain can't comprehend it properly. I am on a M1 Mac so if you have any experience with metalcompute and Python please include any thoughts about that!

Any input is welcomed, thanks in advance!

Answer: Christoph Rackwitz posted a very elaborate and well constructed answer so if you are curious about similar things check out the accepted commend below.

To add to this I ran Christoph's code on my M1 computer to show the results.

2500 calls (numpy)       = 0.0807
2500 calls (other)       = 0.0833
2500 calls (Christoph´s) = 0.0037

Solution

First, your blending equation looks wrong. Even with alpha being 255, you'd only get a 50:50 mix. You'd want something like B = B * (1-alpha) + A * alpha or rearranged B += (A-B) * alpha but that expression has teeth (integer subtraction will have overflow/underflow).

You appear to be drawing "sprites" in a grid on the display of a game. Just use 2D graphics libraries for this, or even 3D (OpenGL?). GPUs are very good at drawing textured quads with transparency. Even without involvement of a GPU, the right library will contain optimized primitives and you don't have to write any of that yourself.

The cost of uploading textures (to GPU memory) is a one-time cost, assuming the sprites don't change appearance. If they change for every frame, that may be noticeable.

Since I originally proposed to use numba and previous answers have only gotten a factor of 2 and then 10 out of it, I'll show a few more points to be aware.

A previous answer proposes a function with this in its inner loop:

B[i, j, :3] = (B[i, j, :3] + A[i, j, :3] * alpha[i, j]) / (1 + alpha[i, j])

Seems sensible because it treats the entire pixel at once but this still uses numpy, which is comparatively slow because it's written to be generic (various dtypes and shapes). Numba has no such requirement. It'll happily generate the specific machine code for this specific situation (uint8, fixed number of dimensions, fixed iterations of inner loop).

If you unroll this one more time, removing the numpy calls from the inner loop, you'll get the true speed of numba:

for k in range(3):
    B[i, j, k] = (B[i, j, k] + A[i, j, k] * alpha[i, j]) / (1 + alpha[i, j])

Timing results (relative speed important, my computer is old):

2500 calls (numpy)    = 0.3845
2500 calls (other)    = 0.5039
2500 calls (mine)     = 0.0901

You can keep going, pulling constants out of each loop, in case LLVM (what numba uses) doesn't notice the optimization.

Cache locality plays a huge role as well. Instead of calculating an entire alpha array once, just calculate the per-pixel alpha in the second to inner loop:

You should look at the dtype of alpha = .../255, which is float64. Use float32 instead of float64 because that's usually faster.

alpha = A[i, j, 3] * np.float32(1/255)
for k in range(3):
    B[i, j, k] = (B[i, j, k] + A[i, j, k] * alpha) / (1 + alpha)

And now let's do integer arithmetic, which is even faster than floating point, and with correct blending:

alphai = np.int32(A[i, j, 3]) # uint8 needs to be widened to avoid overflow/underflow
for k in range(3):
    a = A[i, j, k]
    b = B[i, j, k]
    c = (a * alphai + b * (255 - alphai) + 128) >> 8 # fixed point arithmetic, may be off by 1
    B[i, j, k] = c

Finally:

2500 calls (numpy)    = 0.3904
2500 calls (other)    = 0.5211
2500 calls (mine)     = 0.0118

So that's a speedup of 33. And that's on my old computer that doesn't have any of the latest vector instruction sets.

And instead of calling such a function 250 times, you could call it one time with all the data. That may open up the possibility of parallelism. Numba lets you do it, but it's not trivial...

Since your game display is a grid, you can collect all the sprites for each grid cell (ordered of course). Then you can render each cell in parallel.