Cuda Python Error: TypingError: cannot determine Numba type of <class 'object'>

Background: I'm trying to create a simple bootstrap function for sampling means with replacement. I want to parallelize the function since I will eventually be deploying this on data with millions of data points and will want to have sample sizes much larger. I've ran other examples such as the Mandelbrot example. In the code below you'll see that I have a CPU version of the code, which runs fine as well.

I've read several resources to get this up and running:

Random Numbers with CUDA

Writing Kernels in CUDA

The issue: This is my first foray into CUDA programming and I believe I have everything setup correctly. I'm getting this one error that I cannot seem to figure out:

TypingError: cannot determine Numba type of <class 'object'>

I believe the LOC in question is:

bootstrap_rand_gpu[threads_per_block, blocks_per_grid](rng_states, dt_arry_device, n_samp, out_mean_gpu)

Attempts to resolve the issue: I won't go into full detail, but here are the following attempts

Thought it might have something to do with cuda.to_device(). I changed it around and I also called cuda.to_device_array_like(). I've used to_device() for all parameters, and for just a few. I've seen code samples where it's used for every parameter and sometimes not. So I'm not sure what should be done.
I've removed the random number generator for GPUs (create_xoroshiro128p_states) and just used a static value to test.
Explicitly assigning integers with int() (and not). Not sure why I tried this. I read that Numba only supports a limited data types, so I made sure that they were ints

Numba Supported Datatypes

Few other things I don't recall...

Apologies for messy code. I'm a bit at wits' end on this.

Below is the full code:

import numpy as np
from numpy import random
from numpy.random import randn
import pandas as pd
from timeit import default_timer as timer

from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
from numba import *

def bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean):
    for i in range(boot_samp):
        rand_idx = random.randint(n_samp-1,size=(50)) #get random array of indices 0-49, with replacement
        out_mean[i] = dt_arry[rand_idx].mean()
     
@cuda.jit
def bootstrap_rand_gpu(rng_states, dt_arry, n_samp, out_mean):
    thread_id = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(thread_id, dt_arry.shape[0], stride):
        for k in range(0,n_samp-1,1):
            rand_idx_arry[k] = int(xoroshiro128p_uniform_float32(rng_states, thread_id) * 49)         
        out_mean[thread_id] = dt_arry[rand_idx_arry].mean()



mean = 10
rand_fluc = 3
n_samp = int(50)
boot_samp = int(1000)

dt_arry = (random.rand(n_samp)-.5)*rand_fluc + mean

out_mean_cpu = np.empty(boot_samp)
out_mean_gpu = np.empty(boot_samp)

##################
# RUN ON CPU
##################

start = timer()
bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean_cpu)
dt = timer() - start
print("CPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_cpu.mean()))
print("Bootstrap CPU in %f s" % dt)

##################
# RUN ON GPU
##################

threads_per_block = 64
blocks_per_grid = 24

#create random state for each state in the array
rng_states = create_xoroshiro128p_states(threads_per_block * blocks_per_grid, seed=1) 

start = timer()
dt_arry_device = cuda.to_device(dt_arry)
out_mean_gpu_device = cuda.to_device(out_mean_gpu)
bootstrap_rand_gpu[threads_per_block, blocks_per_grid](rng_states, dt_arry_device, n_samp, out_mean_gpu_device)
out_mean_gpu_device.copy_to_host()
dt = timer() - start

print("GPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_gpu.mean()))
print("Bootstrap GPU in %f s" % dt)

Solution

You seem to have at least 4 issues:

In your kernel code, rand_idx_arry is undefined.
You can't do .mean() in cuda device code
Your kernel launch config parameters are reversed.
Your kernel had an incorrect range for the grid-stride loop. dt_array.shape[0] is 50, so you were only populating the first 50 locations in your gpu output array. Just like your host code, the range for this grid-stride loop should be the size of the output array (which is boot_samp)

There may be other issues as well, but when I refactor your code like this to address those issues, it seems to run without error:

$ cat t65.py
#import matplotlib.pyplot as plt
import numpy as np
from numpy import random
from numpy.random import randn
from timeit import default_timer as timer

from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
from numba import *

def bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean):
    for i in range(boot_samp):
        rand_idx = random.randint(n_samp-1,size=(50)) #get random array of indices 0-49, with replacement
        out_mean[i] = dt_arry[rand_idx].mean()

@cuda.jit
def bootstrap_rand_gpu(rng_states, dt_arry, n_samp, out_mean):
    thread_id = cuda.grid(1)
    stride = cuda.gridsize(1)
    for i in range(thread_id, out_mean.shape[0], stride):
        my_sum = 0.0
        for k in range(0,n_samp-1,1):
            my_sum += dt_arry[int(xoroshiro128p_uniform_float32(rng_states, thread_id) * 49)]
        out_mean[thread_id] = my_sum/(n_samp-1)



mean = 10
rand_fluc = 3
n_samp = int(50)
boot_samp = int(1000)

dt_arry = (random.rand(n_samp)-.5)*rand_fluc + mean

#plt.plot(dt_arry)

#figureData = plt.figure(1)
#plt.title('Plot ' + str(n_samp) + ' samples')
#plt.plot(dt_arry)
#figureData.show()

out_mean_cpu = np.empty(boot_samp)
out_mean_gpu = np.empty(boot_samp)

##################
# RUN ON CPU
##################

start = timer()
bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean_cpu)
dt = timer() - start
print("CPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_cpu.mean()))
print("Bootstrap CPU in %f s" % dt)


#figureMeanCpu = plt.figure(2)
#plt.title('Plot '+ str(boot_samp) + ' bootstrap means - CPU')
#plt.plot(out_mean_cpu)
#figureData.show()


##################
# RUN ON GPU
##################

threads_per_block = 64
blocks_per_grid = 24

#create random state for each state in the array
rng_states = create_xoroshiro128p_states(threads_per_block * blocks_per_grid, seed=1)

start = timer()
dt_arry_device = cuda.to_device(dt_arry)
out_mean_gpu_device = cuda.to_device(out_mean_gpu)
bootstrap_rand_gpu[blocks_per_grid, threads_per_block](rng_states, dt_arry_device, n_samp, out_mean_gpu_device)
out_mean_gpu = out_mean_gpu_device.copy_to_host()
dt = timer() - start

print("GPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_gpu.mean()))
print("Bootstrap GPU in %f s" % dt)
python t65.py
CPU Bootstrap mean of 1000 mean samples: 10.148048544038735
Bootstrap CPU in 0.037496 s
GPU Bootstrap mean of 1000 mean samples: 10.145088765532936
Bootstrap GPU in 0.416822 s
$

Notes:

I've commented out a bunch of stuff that doesn't seem to be relevant. You might want to do something like that in the future when posting code (remove stuff that is not relevant to your question.)
I've fixed some things about your final GPU printout at the end, also.
I haven't studied your code carefully. I'm not suggesting anything is defect free. I'm just pointing out some issues and providing a guide for how they might be addressed. I can see the results don't match between CPU and GPU, but since I don't know what you're doing, and also because the random number generators don't match between CPU and GPU code, it's not obvious to me that things should match.