gpu vectorization numba dispatch numpy-ufunc

Combined vectorized functions in Numba

I'm using Numba (version 0.37.0) to optimize code for GPU. I would like to use combined vectorized functions (using @vectorize decorator of Numba).

Imports & Data:

import numpy as np
from math import sqrt
from numba import vectorize, guvectorize

angles = np.random.uniform(-np.pi, np.pi, 10)
coords = np.stack([np.cos(angles), np.sin(angles)], axis=1)

This works as expected:

@guvectorize(['(float32[:], float32[:])'], '(i)->()', target='cuda')
def l2_norm(vec, out):
    acc = 0.0
    for value in vec:
        acc += value**2
    out[0] = sqrt(acc)

l2_norm(coords)

Output:

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32)

But I'd like to avoid using that "for" inside "l2_norm" by calling another vectorized function.

I've tried this:

@vectorize(["float32(float32)"], target="cuda")
def power(value):
    return value**2

@guvectorize(['(float32[:], float32[:])'], '(i)->()', target='cuda')
def l2_norm_power(vec, out):
    acc = 0.0
    acc = power(vec)
    acc = acc.sum()
    out[0] = sqrt(acc)

l2_norm_power(coords)

But raises TypingError:

TypingError: Failed at nopython (nopython frontend)
Untyped global name 'power': cannot determine Numba type of <class  
'numba.cuda.dispatcher.CUDAUFuncDispatcher'>

Any idea about how to perform this combination?

Any suggestion about other way to optimize l2_norm with Numba?

Solution

I think you can only call device=True functions from other cuda functions:

3.13.2. Example: Calling Device Functions

All CUDA ufunc kernels have the ability to call other CUDA device functions:

 from numba import vectorize, cuda
 # define a device function
 @cuda.jit('float32(float32, float32, float32)', device=True, inline=True)
 def cu_device_fn(x, y, z):
     return x ** y / z
 # define a ufunc that calls our device function
 @vectorize(['float32(float32, float32, float32)'], target='cuda')
 def cu_ufunc(x, y, z):
     return cu_device_fn(x, y, z)

Note that you can call cuda.jit functions with device:

@cuda.jit(device=True)
def sum_of_squares(arr):
    acc = 0
    for item in arr:
        acc += item ** 2
    return acc

@nb.guvectorize(['(float32[:], float32[:])'], '(i)->()', target='cuda')
def l2_norm_power(vec, out):
    acc = sum_of_squares(vec)
    out[0] = sqrt(acc)

l2_norm_power(coords)

But that probably defeats the purpose of splitting it.

Since numba.vectorize doesn't support device that's not possible for these functions. But that's a good thing because vectorize allocates an array to put the values in, that's an unnecessary intermediate array and allocating arrays on the GPU is also very inefficient (and forbidden in numba):

3.5.5. Numpy support

Due to the CUDA programming model, dynamic memory allocation inside a kernel is inefficient and is often not needed. Numba disallows any memory allocating features. This disables a large number of NumPy APIs. For best performance, users should write code such that each thread is dealing with a single element at a time.

Given all that I would simply use your original approach:

@guvectorize(['(float32[:], float32[:])'], '(i)->()', target='cuda')
def l2_norm(vec, out):
    acc = 0.0
    for value in vec:
        acc += value**2
    out[0] = sqrt(acc)

l2_norm(coords)