Numpy ravel takes too long after a slight change to a ndarray

I am working with a flatten image (1920x1080x4), in which I need to reshape (e.g. arr.reshape((1920,1080,4))), remove the last channel (e.g. arr[:,:,:3]), convert from BGR to RGB (e.g. arr[:,:,::-1]) and finally flatten again (e.g. arr.ravel()). The problem is with the ravel/flatten/reshape(-1) operation, that increases about 20ms of the computation time.

To make it easy to debug, I assumed the incoming array as a flatten 1920x1080x3 image, meaning that I only need to worry about the BGR to RGB conversion and flattening. However, when testing reshape+ravel, reshape+BGR2RGB and, finally, reshape+BGR2RGB+ravel, the results were 1ms, 1ms, 20ms respectively, which doesn't make any sense to me, since it's only some values changing position in memory. Is there any reason for the ravel to create a copy of the array? How can i reduce this time?

Note: I also tested the inplace reshape method that's written on the notes of numpy.reshape documentation, but, as specified, an error was raised, meaning that the array need to be copied before in order to reshape.

Bellow is the code I used for testing:

import numpy as np
from time import time

arr_original = np.ones((1920*1080*3), dtype=np.uint8)

arr = arr_original.copy()
s = time()
arr = arr.reshape(1920,1080,3)
arr = arr.ravel()
print(f"Reshape + ravel: {round(1000*(time()-s),2)}ms")

arr = arr_original.copy()
s = time()
arr = arr.reshape(1920,1080,3)
arr = arr[:,:,::-1]
print(f"Reshape + BGR2RGB: {round(1000*(time()-s),2)}ms")

arr = arr_original.copy()
s = time()
arr = arr.reshape(1920,1080,3)
arr = arr[:,:,::-1]
arr = arr.ravel()
print(f"Reshape + BGR2RGB + ravel: {round(1000*(time()-s),2)}ms")

Output on my machine

Reshape + ravel: 0.01ms
Reshape + BGR2RGB: 0.01ms
Reshape + BGR2RGB + ravel: 20.54ms

Solution

This is because all your operations above are producing views for the same data, but the last ravel is requires a copy.

An array in numpy array has an underlying memory, and shape & strides determining where each element lies.

Reshaping a contiguous array may be performed by simply changing shape and strides, without modifying the data. The same is true with slices here. But since your last array is not contiguous, when you use ravel it will make a copy of everything.

For instance in a 3d array accessing the element arr[i,j,k] means to access the memory at base + i * arr.strides[0] + j * arr.strides[1] + k * arr.strides[1] you can doo many things with this (even broadcasting if you use stride 0 in a given axis).

arr_original = np.ones((1920*1080*4), dtype=np.uint8)
arr = arr_original
print(arr.shape, arr.strides)
arr = arr.reshape(1920,1080,4)
print(arr.shape, arr.strides)
arr = arr[:,:,:3] # keep strides only reduces the length of the last axis
print(arr.shape, arr.strides)
arr = arr[:,:,::-1] # change strides of last axis to -1
print(arr.shape, arr.strides)
arr[0,0,:] = [3,4,5] # operations here are using the memory allocated
arr[0,1,:] = [6,7,8] # for arr_original
arr = arr.ravel()
arr[:] = 0 # this won't affect the original because the data was copied
print(arr_original[:8])

Improving your solution

This is the situation where you have to experiment or dive in the library code. I prefer to test different ways of writing the code.

The original approach is the best approach in general, but in this specific case what we have is unaligned memory since you are writing to a uint8 with stride 3.

When judging performance it is important to be aware of what is reasonable to expect, in this case we can compare the format conversion with a pure copy

arr = arr_original.copy()

1.89 ms ± 43.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

arr = arr_original
arr = arr.reshape(1920,1080,4)
arr = arr[:,:,:3] 
arr = arr[:,:,::-1]
arr[0,0,:] = [3,4,5] 
arr[0,1,:] = [6,7,8] 
arr = arr.ravel()

12.3 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) (About 6x times slower than a copy)

arr = arr_original
arr = arr.reshape(1920,1080,4)
arr_aux = np.empty(arr.shape[:-1] + (3,), dtype=np.uint8)
arr_aux[:,:,0] = arr[:,:,2]
arr_aux[:,:,1] = arr[:,:,1]
arr_aux[:,:,2] = arr[:,:,0]
arr = arr_aux.ravel()

4.16 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) (about 2x slower than a copy)

Analysis

In the first case the last axis has very small dimension as well, so maybe this is leading to a small loop. Let's see how this operation can be projected to C++

for(int i = 0; i < height; ++i){
  for(int j = 0; j < width; ++j){
    // this part would be the bottleneck
    for(int k = 0; k < 3; ++k){
      dst[(width * i + j)*3 + k] = src[(width * i + j)*4 + k];
    }
  }
}

Of course numpy is doing more things than this, and the indexes may be computed more efficiently by moving precomputing the part independent of the loop variable outside the loop. The idea here is to be didactic.

Let's count the number of branches executed, each for loop will perform N+1 branches for N iteration (N that enter the loop and the last jump that breaks it). So the above code runs 1 + height * (1 + 1 + width * (1 + 3)) ~ 4 * width * height branches.

If we unroll the innermost loop as

for(int i = 0; i < height; ++i){
  for(int j = 0; j < width; ++j){
    // this part would be the bottleneck
    dst[(width * i + j)*3 + 0] = src[(width * i + j)*4 + 0];
    dst[(width * i + j)*3 + 1] = src[(width * i + j)*4 + 1];
    dst[(width * i + j)*3 + 2] = src[(width * i + j)*4 + 2];
  }
}

The number of branches becomes 1 + height * (1 + 1 + width) ~ height * width, 4 times less. We cannot do this in python because we don't have access to the inner loop. But with the second code we implement something like

for(int i = 0; i < height; ++i){
  for(int j = 0; j < width; ++j){
    // this part would be the bottleneck
    dst[(width * i + j)*3 + 0] = src[(width * i + j)*4 + 0];
  }
}

for(int i = 0; i < height; ++i){
  for(int j = 0; j < width; ++j){
    dst[(width * i + j)*3 + 1] = src[(width * i + j)*4 + 1];
  }
}

for(int i = 0; i < height; ++i){
  for(int j = 0; j < width; ++j){
    dst[(width * i + j)*3 + 2] = src[(width * i + j)*4 + 2];
  }
}

That would still have less branches than the first.

By the improvement observed I imagine the last loop must be calling a function like memcpy or something else with more overhead in an attempt to be faster for bigger slices, maybe checking memory alignment and that will fail since we are using bytes with stride 3.