I am trying to optimize the following python code with cython:
from cython cimport boundscheck, wraparound
@boundscheck(False)
@wraparound(False)
def cython_color2gray(numpy.ndarray[numpy.uint8_t, ndim=3] image):
cdef int x,y,z
cdef double z_val, grey
for x in range(len(image)):
for y in range(len(image[x])):
grey = 0
for z in range(len(image[x][y])):
if z == 0:
z_val = image[x][y][0] * 0.21
grey += z_val
elif z == 1:
z_val = image[x][y][1] * 0.07
grey += z_val
elif z == 2:
z_val = image[x][y][2] * 0.72
grey += z_val
image[x][y][0] = grey
image[x][y][1] = grey
image[x][y][2] = grey
return image
However, when checking if everything is as optimized as it should be, I receive the following yellow lines (see picture). Is there anything else I can do to optimize this cython code and make it run faster?
Here are some key points:
The len()
function is a Python function and has measurable overhead. Since image
is an np.ndarray
anyway, prefer the .shape
attribute to get the number of elements in each dimension.
Consider using image[i, j, k]
instead of image[i][j][k]
for element access.
Prefer typed memoryviews, since the syntax is cleaner and they are faster. For instance, the equivalent memoryview of numpy.ndarray[T, ndim=3]
is T[:, :, :]
, where T
denotes the type of the data elements. If you know that your array's memory layout is C-contiguous, you can specify the layout by using T[:, :, ::1]
. In C, unsigned char
is the smallest unsigned integer type with 8 bits width (on most modern platforms) and thus equivalent to np.uint8_t
. Therefore, your numpy.ndarray[numpy.uint8_t, ndim=3] image
becomes unsigned char[:, :, ::1] image
, given that image
's data is C-contiguous. Alternatively, you could use uint8_t[:, :, ::1]
after cimport
ing the C type uint8_t
from libc.stdint
.
The variable grey
is a double while the elements of image
are np.uint8
(equivalent to unsigned char). So when doing image[i,j,k]=grey
in Python, grey
gets casted to an unsigned char, i.e. the decimal digits are cut off. In Cython, you have to do the cast manually.
After you know your code works as expected, you can further accelerate it with directives for the Cython compiler, e.g. deactivating the bounds checks and negative indices (wraparound). Note that these are decorators that need to be imported.
And your code snippet becomes:
from cython cimport boundscheck, wraparound
@boundscheck(False)
@wraparound(False)
def cython_color2gray(unsigned char[:, :, ::1] image):
cdef int x,y,z
cdef double z_val, grey
for x in range(image.shape[0]):
for y in range(image.shape[1]):
grey = 0
for z in range(image.shape[2]):
if z == 0:
z_val = image[x, y, 0] * 0.21
grey += z_val
elif z == 1:
z_val = image[x, y, 1] * 0.07
grey += z_val
elif z == 2:
z_val = image[x, y, 2] * 0.72
grey += z_val
image[x, y, :] = <unsigned char> grey
return image
Looking closely, you'll see that there's no need for the most inner loop:
from cython cimport boundscheck, wraparound
@boundscheck(False)
@wraparound(False)
def cython_color2gray(unsigned char[:, :, ::1] image):
cdef int x, y
for x in range(image.shape[0]):
for y in range(image.shape[1]):
image[x, y, :] = <unsigned char>(image[x,y,0]*0.21 + image[x,y,1]*0.07 + image[x,y,2] * 0.72)
return image
Going one step further, you can try to accelerate Cython's generated C code by enabling your C compiler's auto-vectorization (in the sense of SIMD). For gcc/clang you can use the flags -O3
and -march=native
. For MSVC it's /O2
and /arch:AVX2
(assuming your machine supports AVX2). If you're working inside a jupyter notebook, you can pass C compiler flags via the -c=YOURFLAG
argument for the Cython magic, i.e.
%%cython -a -f -c=-O3 -c=-march=native
# your cython code here..