Search code examples
pythonc++numpycudalerp

Equivalent of 1D numpy.interp in C++ Cuda (Lerp in CUDA)


I have two vectors 'xp' and 'fp' which correspond to the x and y values respectively of the data. A third vector 'x' which is the x coordinates at which I would like to evaluate the interpolated values. My results in python using NumPy's interp function was as expected.

import numpy as np 

xp = np.array([1.0, 1.5, 2.0, 2.5, 3.5, 4.0, 4.5, 7.0, 8.0, 9.0, 10.0, 14.0, 17.0, 18.0, 20.0])
yp = xp**2
x = np.array([3,5])
np.interp(x,xp,yp) #array([  9.25,  26.  ])

My question is how do I replicate this algorithm inside a cuda kernel?

Here is my attempt:

size --> len(xp) == len(fp), x_size --> len(x).

__global__ void lerp_kernel(double* xp, double* yp, double* output_result, 
                          double* x, unsigned int size, unsigned int x_size)
{
  for( unsigned int idx = blockIdx.x*blockDim.x + threadIdx.x ; idx < size ; idx += 
  blockDim.x*gridDim.x )

{     
 
 if(idx > size - 2){    
  idx = size - 2; 
  }
  double dx = xp[idx + 1] - xp[idx];
  double dy = yp[idx + 1] - yp[idx];
  double dydx = dy/dx ;
  double lerp_result = yp[idx] + (dydx * (x[idx] - xp[idx]));
  output_result[idx] = lerp_result;
  
   }
  }

I think one of the mistakes I am making is that I am not searching for the index range in xp that contains x (using something like numpy.searchsorted in python). I am not sure how to implement this part in CUDA. If someone knows a better way to do lerp in cuda, please let me know.

I have seen lerp functions in the documentation of Cg (https://developer.download.nvidia.com/cg/lerp.html), but these need a weight vector (fraction between 0-1) for the x vector. I am not sure how to rescale x so that I can use a weight vector to solve this problem.


Solution

  • To mimic the behavior of numpy.interp will require several steps. We'll make at least one simplifying assumption: the numpy.interp function expects your xp array to be increasing (we could probably also say "sorted"). Otherwise it specifically mentions a need to do an (internal) sorting. We'll skip that case and assume that your xp array is increasing, as you have shown here.

    The numpy function also allows the x array to be more-or-less arbitrary, from what I can see.

    In order to do a proper interpolation, we must find the "segment" of xp that each x value belongs to. The only way I can think of is to perform a binary search. (also note that thrust has convenient binary searches)

    The process then would be:

    1. using a binary search per element in x, find the corresponding "segment" (i.e. endpoints) in xp
    2. use the equation of a line (y=mx+b), based on the identified segment, to compute the interpolated value between the endpoints

    Here's an example:

    $ cat t40.cu
    #include <iostream>
    
    typedef int lt;
    
    template <typename T>
    __device__ void bsearch_range(const T *a, const T key, const lt len_a, lt *idx){
      lt lower = 0;
      lt upper = len_a;
      lt midpt;
      while (lower < upper){
        midpt = (lower + upper)>>1;
        if (a[midpt] < key) lower = midpt +1;
        else upper = midpt;
        }
      *idx = lower;
      return;
      }
    
    
    template <typename T>
    __global__ void my_interp(const T *xp, const T *yp, const lt xp_len, const lt x_len, const T *x, T *y){
    
      for (lt i = threadIdx.x+blockDim.x*blockIdx.x; i < x_len; i+=gridDim.x*blockDim.x){
        T val = x[i];
        if ((val >= xp[0]) && (val < xp[xp_len - 1])){
          lt idx;
          bsearch_range(xp, val, xp_len, &idx);
          T xlv = xp[idx-1];
          T xrv = xp[idx];
          T ylv = yp[idx-1];
          T yrv = yp[idx];
         // y  =      m                *   x       + b
          y[i] = ((yrv-ylv)/(xrv-xlv)) * (val-xlv) + ylv;
        }
      }
    }
    
    typedef float mt;
    const int nTPB = 256;
    int main(){
    
      mt xp[] = {1.0, 1.5, 2.0, 2.5, 3.5, 4.0, 4.5, 7.0, 8.0, 9.0, 10.0, 14.0, 17.0, 18.0, 20.0};
      lt xp_len = sizeof(xp)/sizeof(xp[0]);
      mt *yp = new mt[xp_len];
      for (lt i = 0; i < xp_len; i++) yp[i] = xp[i]*xp[i];
      mt x[] = {3,5};
      lt x_len = sizeof(x)/sizeof(x[0]);
      mt *y = new mt[x_len];
      mt *d_xp, *d_x, *d_yp, *d_y;
      cudaMalloc(&d_xp, xp_len*sizeof(xp[0]));
      cudaMalloc(&d_yp, xp_len*sizeof(yp[0]));
      cudaMalloc(&d_x,   x_len*sizeof( x[0]));
      cudaMalloc(&d_y,   x_len*sizeof( y[0]));
      cudaMemcpy(d_xp, xp, xp_len*sizeof(xp[0]), cudaMemcpyHostToDevice);
      cudaMemcpy(d_yp, yp, xp_len*sizeof(yp[0]), cudaMemcpyHostToDevice);
      cudaMemcpy(d_x, x, x_len*sizeof(x[0]), cudaMemcpyHostToDevice);
      my_interp<<<(x_len+nTPB-1)/nTPB, nTPB>>>(d_xp, d_yp, xp_len, x_len, d_x, d_y);
      cudaMemcpy(y, d_y, x_len*sizeof(y[0]), cudaMemcpyDeviceToHost);
      for (lt i = 0; i < x_len; i++) std::cout << y[i] << ",";
      std::cout << std::endl;
    }
    $ nvcc -o t40 t40.cu
    $ cuda-memcheck ./t40
    ========= CUDA-MEMCHECK
    9.25,26,
    ========= ERROR SUMMARY: 0 errors
    $
    

    I'm not suggesting the above code is defect-free or suitable for any particular purpose. My objective here is to demonstrate a method, not offer a fully tested code. In particular, edge cases probably need to be tested carefully (values at the edge of the overall range, or outside the range, for example).