python performance loops cython typed-memory-views

why performance of cython loop has diminished in comparison with python one in terms of speed?

I am trying to improve my python code in terms of speed by using cython features. My python code consists of py_child and py_parent classes and py_backup function which is like this:

import random
from time import clock
import numpy as np
from libc.string cimport memcmp
## python code #################################################
class py_child:
    def __init__(self, move):
        self.move = move
        self.Q = 0
        self.N = 0

class py_parent:
    def __init__(self):
        self.children = []
    def add_children(self, moves):
        for move in moves:
            self.children.append(py_child(move))

def py_backup(parent, white_rave, black_rave):
    for point in white_rave:
        for ch in parent.children:
            if ch.move == point:
                ch.Q += 1
                ch.N += 1

    for point in black_rave:
        for ch in parent.children:
            if ch.move == point:
                ch.Q += 1
                ch.N += 1

and this is the same implementation in cython by using memoryviews as some variables:

## cython ######################################################

cdef class cy_child:
    cdef public:
        int[:] move
        int Q
        int N
    def __init__(self, move):
        self.move = move
        self.Q = 0
        self.N = 0

cdef class cy_parent:
    cdef public:
        list children
        int[:, :] moves
    def __init__(self):
        self.children = []
    def add_children(self, moves):
        cdef int i = 0
        cdef int N = len(moves)
        for i in range(N):
            self.children.append(cy_child(moves[i]))

cpdef cy_backup(cy_parent parent_node, int[:, :] white_rave,int[:, :] black_rave):
    cdef int[:] move
    cdef cy_child ch
    for move in white_rave:
        for ch in parent_node.children:
            if memcmp(&move[0], &ch.move[0], move.nbytes) == 0:
                ch.Q += 1
                ch.N += 1

    for move in black_rave:
        for ch in parent_node.children:
            if memcmp(&move[0], &ch.move[0], move.nbytes) == 0:
                ch.Q += 1
                ch.N += 1

Now I want to evaluate the speed of code for functions cy_backup, py_backup.So I use this code:

### Setup variables #########################################
size = 11
board = np.random.randint(2, size=(size, size), dtype=np.int32)

for x in range(board.shape[0]):
    for y in range(board.shape[1]):
        if board[x,y] == 0:
            black_rave.append((x,y))
        else:
            white_rave.append((x,y))

py_temp = []
for i in range(size):
    for j in range(size):
        py_temp.append((i,j))

#### python arguments #######################################

py = py_parent()
py.add_children(py_temp)
# also py_temp, black_rave, white_rave

#### cython arguments #######################################
cy_temp = np.assarray(py_temp, , dtype= np.int32)
cy_black_rave = np.asarray(black_rave, dtype= np.int32)
cy_white_rave = np.asarray(white_rave, dtype= np.int32)
cy = cy_parent()
cy.add_children(cy_temp)

#### Speed test #################################################
%timeit py_backup(py_parent, black_rave, white_rave)
%timeit cy_backup(cy_parent, cy_black_rave, cy_white_rave)

when I ran the program, I was surprised by the results:

1000 loops, best of 3: 759 µs per loop
100 loops, best of 3: 6.38 ms per loop

I was expecting cython to be much more faster than python specially when memoryviews are used.
Why the loop in cython runs slower than loop in python?
It would be highly appreciated if anyone has any suggestion to speed up the code in cython.
In advance I apologize for my question including too much code.

Solution

Cython memoryviews are really only optimised for one thing which is accessing single elements or slices (usually in a loop)

# e.g.
cdef int i
cdef int[:] mview = # something
for i in range(mview.shape[0]):
   mview[i] # do some work with this....

This type of code can be converted directly into efficient C code. For pretty much any other operation the memoryview is treated as a Python object.

Unfortunately almost none of your code takes advantage of the one thing memoryviews are good at, so you get no real speed up. Instead it's actually worse because you've added an extra layer, and a whole load of small length 2 memoryviews is going to be very bad.

My advice is really just to use lists - they're actually pretty good for this kind of thing and it isn't at all clear to me how to rewrite your code to really speed it up with Cython.

Some small optimizations I've spotted: You can get a pretty good idea of how optimised Cython is by looking at the highlighted html file generated by cython -a. You'll see that general iteration of a memoryview is slow (i.e. pure Python). You get an improvement by changing

# instead of:
# for move in white_rave:
for i in range(white_rave.shape[0]):
    move = white_rave[i,:]

This lets Cython iterate the memoryview in an efficient way.

You can get a bit more speed by turning off some of the safety checks for the memcmp line:

with cython.boundscheck(False), cython.initializedcheck(False):
   if memcmp(&move[0], &ch.move[0], move.nbytes) == 0:

(you need to cimport cython). If you do this and you haven't initialized ch.move or both memoryviews doesn't have at least one element then your program may crash.

I realise this isn't a helpful answer, but so long as you want to keep child as a Python class (event a cdef one) there really isn't much you can do to speed it up. You might consider changing it to a C struct (which you could have a C array of) but then you lose all the benefits of working with Python (i.e. you have to manage your own memory and you can't access it easily from Python code).