Why does Cython keep making python objects instead of c?

I am trying to learn cython, where I compile with annotate=True.

Says in The basic manual:
If a line is white, it means that the code generated doesn’t interact with Python, so will run as fast as normal C code. The darker the yellow, the more Python interaction there is in that line

Then I wrote this code following (as much as I understood) numpy in cython basic manual instructions:

+14: cdef entropy(counts):
 15:     '''
 16:     INPUT: pandas table with counts as obsN
 17:     OUTPUT: general entropy
 18:     '''
+19:     cdef int l = counts.shape[0]
+20:     cdef np.ndarray probs = np.zeros(l, dtype=np.float)
+21:     cdef int totals = np.sum(counts)
+22:     probs = counts/totals
+23:     cdef np.ndarray plogp = np.zeros(l, dtype=np.float)
+24:     plogp = ( probs.T * (np.log(probs)) ).T
+25:     cdef float d = np.exp(-1 * np.sum(plogp))
+26:     cdef float relative_d = d / probs.shape[0]
 27: 
+28:     return {'d':d,
+29:             'relative_d':relative_d
 30:             }

Where all the "+" at the beginning of the line are yellow in the cython.debug.output.html file.

What am I doing very wrong? How can I make at least part of this function run at c speed? The function returns a python dictionary, hence I think that I can't returned any c data type. I might be wrong here to.

Thank you for the help!

Solution

First of all, Cython does not rewrite Numpy functions, it just call them like CPython does. This is the case for np.zeros, np.sum or np.log for example. Such calls will not be faster with Cython. If you want a faster code you can use plain loops to reimplement them in you code. However, this may not be faster: on one hand Numpy calls introduce an overhead (due to type checking AFAIK still enabled with Cython, internal function calls, wrappers, etc) certainly significant if you use small arrays and each function generate huge temporary arrays that are often slow to read/write; on the other hand, some Numpy functions makes use of highly-optimized code (like BLAS or low-level SIMD intrinsics). Moreover, the division in Python does not behave the same way than C. This is why Cython provides the flag cython.cdivision which can be set to True (it is False by default). If the Python division is used, Cython generate a slower wrapping code. Finally, np.ndarray is a CPython type and behave as such, you can use memoryviews so not to deal with Numpy objects.

If you want to get a fast code, you certainly need to use memoryviews, loops and and avoid creating temporary arrays as well as using multiple threads. Additionally, you can use np.empty instead of np.zeros in your case. Besides this, the Numpy transposition is not very efficient and Numpy does not solves this problem. You can implement a tiled-transposition to speed it up but this is not trivial to implement it efficiently. Here is a Numba implementation that can certainly be easily transformed to a Cython code. Putting some cdef on a Python Numpy code generally does not make it faster.