Anaconda's NumbaPro CUDA Assertion Error

I am trying to use NumbaPro's cuda extension to multiply large array matrixes. What I want in the end is to multiply a matrix of size NxN by a diagonal matrix that would be fed in as a 1D matrix (thus, a.dot(numpy.diagflat(b)) which I have found to be synonymous to a * b). However, I am getting an assertion error that provides no information.

I can only avoid this assertion error if I multiply two 1D array matrixes but that is not what I want to do.

from numbapro import vectorize, cuda
from numba import f4,f8
import numpy as np

def generate_input(n):
    import numpy as np
    A = np.array(np.random.sample((n,n)))
    B = np.array(np.random.sample(n) + 10)
    return A, B

def product(a, b):
    return a * b

def main():
    cu_product = vectorize([f4(f4, f4), f8(f8, f8)], target='gpu')(product)

    N = 1000

    A, B = generate_input(N)
    D = np.empty(A.shape)

    stream = cuda.stream()

    with stream.auto_synchronize():
        dA = cuda.to_device(A, stream)
        dB = cuda.to_device(B, stream)
        dD = cuda.to_device(D, stream, copy=False)
        cu_product(dA, dB, out=dD, stream=stream)
        dD.to_host(stream)

if __name__ == '__main__':
    main()

This is what my terminal spits out:

Traceback (most recent call last):
  File "cuda_vectorize.py", line 32, in <module>
    main()
  File "cuda_vectorize.py", line 28, in main
    cu_product(dA, dB, out=dD, stream=stream)
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 109, in __call__
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 191, in _arguments_requirement
AssertionError

Solution

The problem is you are using vectorize on a function that takes non-scalar arguments. The idea with NumbaPro's vectorize is that it takes a scalar function as input, and generates a function that applies the scalar operation in parallel to all the elements of a vector. See the NumbaPro documentation.

Your function takes a matrix and a vector, which are definitely not scalar. [Edit] You can do what you want on the GPU using either NumbaPro's wrapper for cuBLAS, or by writing your own simple kernel function. Here's an example that demonstrates both. Note will need NumbaPro 0.12.2 or later (just released as of this edit).

from numbapro import jit, cuda
from numba import float32
import numbapro.cudalib.cublas as cublas
import numpy as np
from timeit import default_timer as timer

def generate_input(n):
    A = np.array(np.random.sample((n,n)), dtype=np.float32)
    B = np.array(np.random.sample(n), dtype=A.dtype)
    return A, B

@cuda.jit(argtypes=[float32[:,:], float32[:,:], float32[:]])
def diagproduct(c, a, b):
  startX, startY = cuda.grid(2)
  gridX = cuda.gridDim.x * cuda.blockDim.x;
  gridY = cuda.gridDim.y * cuda.blockDim.y;
  height, width = c.shape

  for y in range(startY, height, gridY):
    for x in range(startX, width, gridX):       
      c[y, x] = a[y, x] * b[x]

def main():

    N = 1000

    A, B = generate_input(N)
    D = np.empty(A.shape, dtype=A.dtype)
    E = np.zeros(A.shape, dtype=A.dtype)
    F = np.empty(A.shape, dtype=A.dtype)

    start = timer()
    E = np.dot(A, np.diag(B))
    numpy_time = timer() - start

    blas = cublas.api.Blas()

    start = timer()
    blas.gemm('N', 'N', N, N, N, 1.0, np.diag(B), A, 0.0, D)
    cublas_time = timer() - start

    diff = np.abs(D-E)
    print("Maximum CUBLAS error %f" % np.max(diff))

    blockdim = (32, 8)
    griddim  = (16, 16)

    start = timer()
    dA = cuda.to_device(A)
    dB = cuda.to_device(B)
    dF = cuda.to_device(F, copy=False)
    diagproduct[griddim, blockdim](dF, dA, dB)
    dF.to_host()
    cuda_time = timer() - start   

    diff = np.abs(F-E)
    print("Maximum CUDA error %f" % np.max(diff))

    print("Numpy took    %f seconds" % numpy_time)
    print("CUBLAS took   %f seconds, %0.2fx speedup" % (cublas_time, numpy_time / cublas_time)) 
    print("CUDA JIT took %f seconds, %0.2fx speedup" % (cuda_time, numpy_time / cuda_time))

if __name__ == '__main__':
    main()

The kernel is significantly faster because SGEMM does a full matrix-matrix multiply (O(n^3)), and expands the diagonal into a full matrix. The diagproduct function is smarter. It simply does a single multiply for each matrix element, and never expands the diagonal to a full matrix. Here are the results on my NVIDIA Tesla K20c GPU for N=1000:

Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took    0.024535 seconds
CUBLAS took   0.010345 seconds, 2.37x speedup
CUDA JIT took 0.004857 seconds, 5.05x speedup

The timing includes all of the copies to and from the GPU, which is a significant bottleneck for small matrices. If we set N to 10,000 and run again, we get a much bigger speedup:

Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took    7.245677 seconds
CUBLAS took   1.371524 seconds, 5.28x speedup
CUDA JIT took 0.264598 seconds, 27.38x speedup

For very small matrices, however, CUBLAS SGEMM has an optimized path so it is closer to the CUDA performance. Here, N=100

Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took    0.006876 seconds
CUBLAS took   0.001425 seconds, 4.83x speedup
CUDA JIT took 0.001313 seconds, 5.24x speedup