I started working on a SciPy interface to Fortran libraries (BLAS/LAPACK) as one can see here: Calling BLAS / LAPACK directly using the SciPy interface and Cython and came up with a solution but had to resort to using numpy.zeros
which in effect killed any speed gains from calling the Fortran code directly. The issue was the Fortran code needs an 0 valued output matrix (it operates on the matrix in-place in memory) to match the Numpy version (np.outer
).
So I was a bit perplexed since a 1000x1000 zeros matrix in Python only take 8us (using %timeit, or 0.008ms) so why did adding Cython code kill the runtime, noting I'm also creating it on a memoryview? (basically it goes from 3ms to 8ms or so on a 1000 x 1000 matrix multiplication). Then after searching around on SO I found a solution elsewhere using memset
as the fastest array changing mechanism. So I rewrote the whole code from the referenced post to the newer memoryview
format and got to something like this (Cython cyblas.pyx
file):
import cython
import numpy as np
cimport numpy as np
from libc.string cimport memset #faster than np.zeros
REAL = np.float64
ctypedef np.float64_t REAL_t
ctypedef np.uint64_t INT_t
cdef int ONE = 1
cdef REAL_t ONEF = <REAL_t>1.0
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
cpdef outer_prod(double[::1] _x, double[::1] _y, double[:, ::1] _output):
cdef int M = _y.shape[0]
cdef int N = _x.shape[0]
memset(&_output[0,0], 0, M*N)
with nogil:
dger(&M, &N, &ONEF, &_y[0], &ONE, &_x[0], &ONE, &_output[0,0], &M)
import numpy as np;
from cyblas import outer_prod;
a=np.random.randint(0,100, 1000);
b=np.random.randint(0,100, 1000);
a=a.astype(np.float64)
b=b.astype(np.float64)
cy_outer=np.zeros((a.shape[0],b.shape[0]));
np_outer=np.zeros((a.shape[0],b.shape[0]));
%timeit outer_prod(a,b, cy_outer)
%timeit np.outer(a,b, np_outer)
So this fixed my issue of resetting the output matrix values BUT ONLY TO ROW 125, BEYOND THAT THIS ISSUE STILL EXISTS (resolved see below). I thought setting the memset
length parameter to M*N would clear 1000*1000 in memory but apparently not.
Does anyone have an idea on how to reset the whole output matrix to 0 using memset
? Much appreciated.
[UPDATE - fixes are: it needed the #bytes not just the array size for M*N
, i.e. M*N*variable_bytes
as the length input. The logic is seen here by the prior results: row 125 is where it stopped *8 bytes = 1000 rows, hence answers the question. Still metrics are not great: 100 loops, best of 3: 5.41 ms per loop (cython) 100 loops, best of 3: 3.95 ms per loop (numpy) but solved nonetheless.
Changes to the above code are to add: cdef variable_bytes = np.dtype(REAL).itemsize #added to get bytes for memset, after REAL is defined, in this case 8 bytes
Then as you call memset: memset(&_output[0,0], 0, M*N*variable_bytes) # gives the same output as np.zeros function
The only place to speed this up now I can see is to use the prange
OpenMP statements on large matrices, but yet to be seen.