Cython optimization slow

I am trying to optimize the following python code with cython:

from cython cimport boundscheck, wraparound

@boundscheck(False)
@wraparound(False)
def cython_color2gray(numpy.ndarray[numpy.uint8_t, ndim=3] image):
    cdef int x,y,z
    cdef double z_val, grey
    for x in range(len(image)):
        for y in range(len(image[x])):
            grey = 0
            for z in range(len(image[x][y])):
                if z == 0:
                    z_val = image[x][y][0] * 0.21
                    grey += z_val
                elif z == 1:
                    z_val = image[x][y][1] * 0.07
                    grey += z_val
                elif z == 2:
                    z_val = image[x][y][2] * 0.72
                    grey += z_val
            image[x][y][0] = grey
            image[x][y][1] = grey
            image[x][y][2] = grey
    return image

However, when checking if everything is as optimized as it should be, I receive the following yellow lines (see picture). Is there anything else I can do to optimize this cython code and make it run faster?

Output cython file

Solution

Here are some key points:

The len() function is a Python function and has measurable overhead. Since image is an np.ndarray anyway, prefer the .shape attribute to get the number of elements in each dimension.
Consider using image[i, j, k] instead of image[i][j][k] for element access.
Prefer typed memoryviews, since the syntax is cleaner and they are faster. For instance, the equivalent memoryview of numpy.ndarray[T, ndim=3] is T[:, :, :], where T denotes the type of the data elements. If you know that your array's memory layout is C-contiguous, you can specify the layout by using T[:, :, ::1]. In C, unsigned char is the smallest unsigned integer type with 8 bits width (on most modern platforms) and thus equivalent to np.uint8_t. Therefore, your numpy.ndarray[numpy.uint8_t, ndim=3] image becomes unsigned char[:, :, ::1] image, given that image's data is C-contiguous. Alternatively, you could use uint8_t[:, :, ::1] after cimporting the C type uint8_t from libc.stdint.
The variable grey is a double while the elements of image are np.uint8 (equivalent to unsigned char). So when doing image[i,j,k]=grey in Python, grey gets casted to an unsigned char, i.e. the decimal digits are cut off. In Cython, you have to do the cast manually.
After you know your code works as expected, you can further accelerate it with directives for the Cython compiler, e.g. deactivating the bounds checks and negative indices (wraparound). Note that these are decorators that need to be imported.

And your code snippet becomes:

from cython cimport boundscheck, wraparound

@boundscheck(False)
@wraparound(False)
def cython_color2gray(unsigned char[:, :, ::1] image):
    cdef int x,y,z
    cdef double z_val, grey
    for x in range(image.shape[0]):
        for y in range(image.shape[1]):
            grey = 0
            for z in range(image.shape[2]):
                if z == 0:
                    z_val = image[x, y, 0] * 0.21
                    grey += z_val
                elif z == 1:
                    z_val = image[x, y, 1] * 0.07
                    grey += z_val
                elif z == 2:
                    z_val = image[x, y, 2] * 0.72
                    grey += z_val
            image[x, y, :] = <unsigned char> grey
    return image

Looking closely, you'll see that there's no need for the most inner loop:

from cython cimport boundscheck, wraparound

@boundscheck(False)
@wraparound(False)
def cython_color2gray(unsigned char[:, :, ::1] image):
    cdef int x, y
    for x in range(image.shape[0]):
        for y in range(image.shape[1]):
            image[x, y, :] = <unsigned char>(image[x,y,0]*0.21 + image[x,y,1]*0.07 + image[x,y,2] * 0.72)
    return image

Going one step further, you can try to accelerate Cython's generated C code by enabling your C compiler's auto-vectorization (in the sense of SIMD). For gcc/clang you can use the flags -O3 and -march=native. For MSVC it's /O2 and /arch:AVX2 (assuming your machine supports AVX2). If you're working inside a jupyter notebook, you can pass C compiler flags via the -c=YOURFLAG argument for the Cython magic, i.e.

%%cython -a -f -c=-O3 -c=-march=native
# your cython code here..