Search code examples
pythonopenclpyopencl

PyOpenCL Error when accessing result of kernel operation (RuntimeError: Unable to compute length of object)


I'm new to OpenCL implementations and trying to make a relatively simple data processing script. The script works like so:

Before OpenCL side of script

  1. Load data
  2. Prepare some arrays for use in processing

This is all done before the comment: ## Step #1. Obtain an OpenCL platform.

OpenCL side:

  1. Acquire platform and device, and create context
  2. Create kernel
  3. Create queue
  4. Put the arrays to be used in processing in PyOpenCL buffers
  5. Run kernel
  6. Get kernel output

Code:


    # -*- coding: utf-8 -*-
    """
    Created on Sun Nov 17 16:14:50 2019

    @author: Mike
    based on example provided here:
        https://www.drdobbs.com/open-source/easy-opencl-with-python/240162614?pgno=2

    """

    import pyopencl as cl
    from pyopencl import cltypes
    from pyopencl import array
    from pyopencl.elementwise import ElementwiseKernel
    import numpy as np
    import time
    import matplotlib.pyplot as plt
    from reikna import fft

    if __name__ == "__main__":

        # Load some test data!
        rawdata = np.load('fig8_13_raw_5000nm_2p0.npy')
        posdata = np.zeros((2048,40))
        flat_rawdata = rawdata.flatten()
        #t0 = flat_rawdata[0:2048*40].astype(np.uint16)  # The first B-scan for use with the demo\
        t0=flat_rawdata.astype(np.uint16)
        n = 40  # Total A-scans in input (not all are included with B)
        x = 40  # A-scans in output

        B1 = posdata[2,:].astype(np.bool)
        B2 = posdata[3,:].astype(np.bool)
        b = np.logical_or(B1,B2)

        b = np.zeros(40).astype(bool)
        b[0:40] = 1

        # Load chirp matrix, containing wavelengths corresponding to spectrum bins for lambda->k interpolation
        lam = np.load('lam.npy')
        # Define apodization window
        window = np.hanning(2048).astype(np.float16)

        lam_min = np.amin(lam)
        lam_max = np.amax(lam)
        d_lam = lam_max-lam_min
        d_k = (1/lam_min - 1/lam_max)/2048
        k = np.array([1/((1/lam_max)+d_k*i) for i in range(2048)])

        nn0 = np.zeros(2048,dtype=np.uint16)
        nn1 = np.zeros(2048,dtype=np.uint16)

        for i in range(2048):
            res = np.abs(lam-k[i])
            minind = np.argmin(res)
            if res[minind]>=0:
                nn0[i]=minind-1
                nn1[i]=minind
            else:
                nn0[i]=minind
                nn1[i]=minind+1

        ## Step #1. Obtain an OpenCL platform.
        platform = cl.get_platforms()
        platform = platform[0]

        print('\nAcquired platform: \n\t'+platform.name)

        ## Step #2. Obtain a device id for at least one device (accelerator).
        device = platform.get_devices()
        device = device[0]

        print('Acquired device: \n\t'+device.name)
        extensions = ['\t'+x for x in device.extensions.split()]
        print('Device extensions: ')
        for i in range(len(extensions)):
            print(extensions[i])

        ## Step #3. Create a context for the selected device.
        context = cl.Context([device])
        print('Created context.')     

        ## Step #6. Create one or more kernels from the program functions.
        program = cl.Program(context, """
            __kernel void interp_hann(__global const double *lambda_spec,
            __global const double *win, __global const double *k,__global int *nn0,
            __global int *nn1,__global const double *lam, __global double *result, double d_lam)
            {
                int gid = get_global_id(0);
                int gid1 = gid % 2048;
                double y1 = lambda_spec[nn0[gid1]];  // y-values from neighbors in spectrum
                double y2 = lambda_spec[nn1[gid1]];
                double x1 = lam[nn0[gid1]];  // corresponding initial wavelength
                double x = k[gid1];  // linear-in-wavenumber interpolation point

                if (y1 == y2)
                {
                    result[gid] = y1*win[gid1];
                }
                else
                {
                    result[gid] = (y1 + (x - x1) / (y2 - y1) * d_lam) * win[gid1];
                }

            }
            """).build()

        ## Step #7. Create a command queue for the target device.
        queue = cl.CommandQueue(context)

        ## Step #8. Allocate device memory and move input data from the host to the device memory.   
        result = np.zeros(t0.shape,dtype=cl.cltypes.float)
        mem_flags = cl.mem_flags
        n0_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=nn0)
        n1_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=nn1)
        win_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=window)
        k_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=k)
        raw_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=t0)
        lam_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=lam)
        dest_buf = cl.Buffer(context, mem_flags.WRITE_ONLY, result.nbytes)
        ## Step #9. Associate the arguments to the kernel with kernel object.

        ## Step #10. Deploy the kernel for device execution.

        # check that array sizes are all correct
        print(nn0.shape,nn1.shape,window.shape,k.shape,t0.shape,lam.shape,result.shape)

        # run kernel and get event to wait for
        evt=program.interp_hann(queue, (len(result),), None, raw_buf, win_buf, k_buf,
                            n0_buf, n1_buf, lam_buf, dest_buf, d_lam)

        ## Step #11. Move the kernel’s output data to host memory.

        cl.enqueue_copy(queue,result,dest_buf,wait_for=evt)

        print(result)
        ## Step #12. Release context, program, kernels and memory.
        ## PyOpenCL performs this step for you, and therefore,
        ## you don't need to worry about cleanup code

Problem: Running the script I get the following output/error :

Acquired platform: 
        NVIDIA CUDA
Acquired device: 
        GeForce RTX 2080
Device extensions: 
        cl_khr_global_int32_base_atomics
        cl_khr_global_int32_extended_atomics
        cl_khr_local_int32_base_atomics
        cl_khr_local_int32_extended_atomics
        cl_khr_fp64
        cl_khr_byte_addressable_store
        cl_khr_icd
        cl_khr_gl_sharing
        cl_nv_compiler_options
        cl_nv_device_attribute_query
        cl_nv_pragma_unroll
        cl_nv_d3d10_sharing
        cl_khr_d3d10_sharing
        cl_nv_d3d11_sharing
        cl_nv_copy_opts
        cl_nv_create_buffer
        cl_khr_int64_base_atomics
        cl_khr_int64_extended_atomics
Created context.
(2048,) (2048,) (2048,) (2048,) (40960000,) (2048,) (40960000,)
Traceback (most recent call last):

  File "C:\Users\Mike\Desktop\Drive\PC Workspace\Senior Design\opencl\test.py", line 134, in <module>
    cl.enqueue_copy(queue,result1,dest_buf,wait_for=evt)

  File "C:\Users\Mike\Anaconda3\lib\site-packages\pyopencl\__init__.py", line 1719, in enqueue_copy
    return _cl._enqueue_read_buffer(queue, src, dest, **kwargs)

RuntimeError: Unable to compute length of object

I can't really figure out what this error means, and am unsure how to approach debugging.

The numbers before the traceback are the lengths of the arrays used in the kernel. What I am trying to do is go through all 40960000 elements of t0/raw_buf with gid as the iterator, gid1 will maintain the 0-2047 indexing with gid % 2048 for use with objects of length 2048. I figure something is going wrong in this aspect of the kernel since I am really unclear on how the get_global_id() function works.

I am also unclear on how to determine the 2nd and 3rd arguments to the kernel (under comment: # run kernel and get event to wait for). I know that they correspond to global_size and local_size, but am unsure of how I should determine these arguments for my application. Any advice (or recommended reading) would be greatly appreciated.


Solution

  • I'm not able to run your example because I don't have fig8_13_raw_5000nm_2p0.npy, but I've been having a similar error when I try to have the copy wait on the event returned by calling the kernel.

    Try this:

        program.interp_hann(queue, (len(result),), None, raw_buf, win_buf, k_buf,
                            n0_buf, n1_buf, lam_buf, dest_buf, d_lam)
        cl.enqueue_copy(queue,result,dest_buf)