python-3.x cython memset typed-memory-views

Using memset instead of np.zeros in a Cython script for speed gains

I started working on a SciPy interface to Fortran libraries (BLAS/LAPACK) as one can see here: Calling BLAS / LAPACK directly using the SciPy interface and Cython and came up with a solution but had to resort to using numpy.zeros which in effect killed any speed gains from calling the Fortran code directly. The issue was the Fortran code needs an 0 valued output matrix (it operates on the matrix in-place in memory) to match the Numpy version (np.outer).

So I was a bit perplexed since a 1000x1000 zeros matrix in Python only take 8us (using %timeit, or 0.008ms) so why did adding Cython code kill the runtime, noting I'm also creating it on a memoryview? (basically it goes from 3ms to 8ms or so on a 1000 x 1000 matrix multiplication). Then after searching around on SO I found a solution elsewhere using memset as the fastest array changing mechanism. So I rewrote the whole code from the referenced post to the newer memoryview format and got to something like this (Cython cyblas.pyx file):

import cython
import numpy as np
cimport numpy as np
from libc.string cimport memset #faster than np.zeros

REAL = np.float64
ctypedef np.float64_t REAL_t
ctypedef np.uint64_t  INT_t

cdef int ONE = 1
cdef REAL_t ONEF = <REAL_t>1.0

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
cpdef outer_prod(double[::1] _x, double[::1] _y, double[:, ::1] _output):

    cdef int M = _y.shape[0]
    cdef int N = _x.shape[0]
    memset(&_output[0,0], 0, M*N)

    with nogil:
        dger(&M, &N, &ONEF, &_y[0], &ONE, &_x[0], &ONE, &_output[0,0], &M)

TEST SCRIPT

import numpy as np;
from cyblas import outer_prod;
a=np.random.randint(0,100, 1000);
b=np.random.randint(0,100, 1000);
a=a.astype(np.float64)
b=b.astype(np.float64)
cy_outer=np.zeros((a.shape[0],b.shape[0]));
np_outer=np.zeros((a.shape[0],b.shape[0]));

%timeit outer_prod(a,b, cy_outer)
%timeit np.outer(a,b, np_outer)

So this fixed my issue of resetting the output matrix values BUT ONLY TO ROW 125, BEYOND THAT THIS ISSUE STILL EXISTS (resolved see below). I thought setting the memset length parameter to M*N would clear 1000*1000 in memory but apparently not.

Does anyone have an idea on how to reset the whole output matrix to 0 using memset? Much appreciated.

Solution

[UPDATE - fixes are: it needed the #bytes not just the array size for M*N, i.e. M*N*variable_bytes as the length input. The logic is seen here by the prior results: row 125 is where it stopped *8 bytes = 1000 rows, hence answers the question. Still metrics are not great: 100 loops, best of 3: 5.41 ms per loop (cython) 100 loops, best of 3: 3.95 ms per loop (numpy) but solved nonetheless. Changes to the above code are to add: cdef variable_bytes = np.dtype(REAL).itemsize #added to get bytes for memset, after REAL is defined, in this case 8 bytes Then as you call memset: memset(&_output[0,0], 0, M*N*variable_bytes) # gives the same output as np.zeros function The only place to speed this up now I can see is to use the prange OpenMP statements on large matrices, but yet to be seen.