Search code examples
c++cudathrust

What is the most efficient way to apply a functor to a subset of a device array?


I am rewriting a library that performs calculations and other operations on data that is stored in contiguous chunks of memory so that it can work on GPUs using the CUDA framework. The data represents information that lives on a 4-dimensional grid. The total size of the grid can range from 1000's to millions of grid points. Along each direction, the grid may have as little as 8 or as much as 100's of points. My question is about what is the best way to implement operations on a subset of the grid. For example, suppose that my grid is [0,nx)x[0,ny)x[0,nz)x[0,nq), and I want to implement a transformation that multiplies all the points whose indexes belong to [1,nx-1)x[1,ny-1)x[1,nz-1)x[0,nq-1) by minus 1.

Right now, what I do is via nested loops. This is a skeleton of code

{ 
int nx,ny,nz,nq;
nx=10,ny=10,nz=10,nq=10;
typedef thrust::device_vector<double> Array;
Array A(nx*ny*nz*nq);
thrust::fill(A.begin(), A.end(), (double) 1);

for (auto q=1; q<nq-1; ++q){
for (auto k=1; k<nz-1; ++k){
for (auto j=1; j<ny-1; ++j){
int offset1=+1+j*nx+k*nx*ny+q*nx*ny*nz;
int offset2=offset1+nx-2;
thrust::transform(A.begin()+offset1, 
                  A.begin()+offset2, 
                  thrust::negate<double>());
      }
    }
  }
}

However, I wonder if this is the most efficient way, because it seems to me that in this case at most only nx-2 threads can be run simultaneously. So I was thinking that perhaps a better way would be to generate a sequence iterator (returning the linear position along the array), zip it to the array with a zip iterator, and defining a functor that examines the second element of the tuple (the position value) and if that value falls into the accepted range, modify the first element of the tuple. However, there may be a better way to do that. I am new to CUDA, and to make matter worse I really cut my teeth with Fortran, so it is hard for me to think outside the for-loop box...


Solution

  • I'm not sure what is the most efficient way. I can suggest what I think will be more efficient than your skeleton code.

    Your proposal in the text is headed in the right direction. Rather than use a set of nested for-loops that will iterate potentially quite a few times, we should seek to get everything done in one thrust call. But we still need to have that one thrust call only modify the array values at the indices within the "cubic" volume to be operated on.

    We don't want to use a method involving testing of a generated index against the valid index volume, however, as you seem to be suggesting. This would require us to launch a grid as large as our array, even if we only wanted to modify a small volume of it.

    Instead, we launch an operation that is just large enough to cover the needed number of elements to modify, and we create a functor which does a linear index -> 4D index -> adjusted linear index conversion. That functor then operates within a transform iterator to convert an ordinary linear sequence starting at 0, 1, 2, etc. to a sequence that starts and stays within the volume to be modified. A permutation iterator is then used with this modified sequence to select the values of the array to modify.

    Here's an example showing the difference in timing for your nested loop method (1) vs. mine (2) for an array of 64x64x64x64 and a modified volume of 62x62x62x62:

    $ cat t39.cu
    #include <thrust/device_vector.h>
    #include <thrust/transform.h>
    #include <thrust/iterator/permutation_iterator.h>
    #include <thrust/iterator/zip_iterator.h>
    #include <thrust/iterator/counting_iterator.h>
    #include <thrust/functional.h>
    #include <thrust/equal.h>
    #include <cassert>
    #include <iostream>
    
    struct my_idx
    {
      int nx, ny, nz, nq, lx, ly, lz, lq, dx, dy, dz, dq;
      my_idx(int _nx, int _ny, int _nz, int _nq, int _lx, int _ly, int _lz, int _lq, int _hx, int _hy, int _hz, int _hq) {
        nx = _nx;
        ny = _ny;
        nz = _nz;
        nq = _nq;
        lx = _lx;
        ly = _ly;
        lz = _lz;
        lq = _lq;
        dx = _hx - lx;
        dy = _hy - ly;
        dz = _hz - lz;
        dq = _hq - lq;
        // could do a lot of assert checking here
      }
    
      __host__ __device__
      int operator()(int idx){
        int rx = idx / dx;
        int ix = idx - (rx * dx);
        int ry = rx / dy;
        int iy = rx - (ry * dy);
        int rz = ry / dz;
        int iz = ry - (rz * dz);
        int rq = rz / dq;
        int iq = rz - (rq * dq);
        return (((iq+lq)*nz+iz+lz)*ny+iy+ly)*nx+ix+lx;
      }
    };
    
    #include <time.h>
    #include <sys/time.h>
    #define USECPSEC 1000000ULL
    
    unsigned long long dtime_usec(unsigned long long start){
    
      timeval tv;
      gettimeofday(&tv, 0);
      return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
    }
    
    
    int main()
    {
      int nx,ny,nz,nq,lx,ly,lz,lq,hx,hy,hz,hq;
      nx=64,ny=64,nz=64,nq=64;
      lx=1,ly=1,lz=1,lq=1;
      hx=nx-1,hy=ny-1,hz=nz-1,hq=nq-1;
      thrust::device_vector<double> A(nx*ny*nz*nq);
      thrust::device_vector<double> B(nx*ny*nz*nq);
      thrust::fill(A.begin(), A.end(), (double) 1);
      thrust::fill(B.begin(), B.end(), (double) 1);
      // method 1
      unsigned long long m1_time = dtime_usec(0);
      for (auto q=lq; q<hq; ++q){
        for (auto k=lz; k<hz; ++k){
          for (auto j=ly; j<hy; ++j){
            int offset1=lx+j*nx+k*nx*ny+q*nx*ny*nz;
            int offset2=offset1+(hx-lx);
            thrust::transform(A.begin()+offset1,
                      A.begin()+offset2, A.begin()+offset1,
                      thrust::negate<double>());
          }
        }
      }
      cudaDeviceSynchronize();
      m1_time = dtime_usec(m1_time);
    
      // method 2
      unsigned long long m2_time = dtime_usec(0);
      auto p = thrust::make_permutation_iterator(B.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), my_idx(nx, ny, nz, nq, lx, ly, lz, lq, hx, hy, hz, hq)));
      thrust::transform(p, p+(hx-lx)*(hy-ly)*(hz-lz)*(hq-lq), p, thrust::negate<double>());
      cudaDeviceSynchronize();
      m2_time = dtime_usec(m2_time);
      if (thrust::equal(A.begin(), A.end(), B.begin()))
        std::cout << "method 1 time: " << m1_time/(float)USECPSEC << "s method 2 time: " << m2_time/(float)USECPSEC << "s" << std::endl;
      else
        std::cout << "mismatch error" << std::endl;
    }
    $ nvcc -std=c++11 t39.cu -o t39
    $ ./t39
    method 1 time: 1.6005s method 2 time: 0.013182s
    $