Search code examples
cudagpgpureductioncuffttoeplitz

Blockwise/Strided reduction using CUDA


TLDR: I am trying to write a GPU code that computes a blockwise reduction on an array. The input looks like [block_0, trash_0, block_1, trash_1, ..., block_n, trash_n], and I want to compute block_0 + block_1 + ... + block_n. Relevant sizes: each block (and trash-block) has size ~1,000 and there are ~10,000-100,000 blocks.

Longer Explanation/Context: I am working on a GPU code that computes the matrix-vector product of a matrix whose blocks are lower triangular and Toeplitz. Here is what I have so far:

Consider a block-row of the matrix T = [T_{i0}, T_{i1}, ..., T_{in}]. Each T_{ij} is lower triangular and Toeplitz, so for example will look like [[a, 0, 0],[b, a, 0], [c, b, a]]. Consider also the vector v = [v_0, v_1, ..., v_n] also in block form, so for example v_j will look like [x, y, z]. To get the i'th block of the output, we need to compute the sum T_{i0}v_0 + T_{i1}v1 + ... + T_{in}v_n. Since each T_{ij} is Toeplitz, we compute the matvecs T_{ij}v_j as follows:

  1. Pad the first column of T_{ij} and v_j with zeros to twice the length. Continuing the 3x3 example from above, this looks like T_{ij} -> [a, b, c, 0, 0, 0] and v_j -> [x, y, z, 0, 0, 0]. We do this for each block j = 0...n and join the padded vectors together. After this step, we will have the arrays T_i_pad -> [a0, b0, c0, 0, 0, 0, a1, b1, c1, 0, 0, 0, ... , an, bn, cn, 0, 0 0] and v_pad -> [x0, y0, z0, 0, 0, 0, x1, y1, z1, 0, 0, 0, ..., xn, yn, zn].

  2. Take the FFT of the T_i_pad and v_pad arrays. For this, I am using the cufft library with the cufftPlanMany() function to do the FFTs in a batched way. I only deal with real inputs (double precision), so I use the D2Z options.

  3. Do a pointwise multiply of the FFT'd arrays and scale by 1/sqrt(length_of_each_padded_block) (length_of_each_padded_block = 6 in this example).

  4. Take an inverse batched FFT (Z2D). The result will be an array the same size as the initial v_pad array and will contain the block matvecs I care about as well as some other terms that don't matter arranged as follows result = [T_{i0}v_0, ***, T_{i1}v_1, ***, ..., T_{in}v_n, ***]. In the 3x3 example, each T_{ij}v_j will have 3 elements and so will the ***.

All of this I have already done and is working correctly. The next step is to do a reduction over the result array so that I get the output T_{i0}v_0 + T_{i1}v1 + ... + T_{in}v_n. This is where I am not sure what the best option is. Also, while I used a 3x3 example here, the scale I am actually looking to run this on is where each block has size O(10^3) and there are O(10^4) or O(10^5) such blocks.

The way I am currently implementing the reduction is using CUDA cooperative grids. I am using the standard binary-tree reduction algorithm at the threadblock level rather than at the thread level as is much more common. For this example, let us assume there are 16 blocks that need to be reduced, each one with 3 elements as above. The algorithm I have now proceeds as follows:

  1. Launch a grid of 8 threadblocks. At the first stage of the binary tree, each block will read 2 of the blocks of result and compute a sum, writing to the left in global memory. For example, threadblock 0 will add T_{i0}v_0 + T_{i1}v_1 and write this over T_{i0}v_0.

  2. Synchronize across blocks between each stage of the binary tree with the CUDA cooperative grids synchronize method. Continue the binary-tree reduction pattern.

  3. At the end, the result will be stored where T_{i0}v_0 originally was.

Here is the kernel code and the host code that calls it. Note that there are some further levels of complication to make sure that no more than the max number of blocks that can be run at one time are launched and to handle non-power-of-2 cases. I am also using the vector data type double2 since I read in a lot of places that it was better to use.

#include <stdio.h>
#include <iostream>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <cmath>
#include <cooperative_groups.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
#define gpuErrchk(x) x
const int MAX_BLOCK_SIZE=1024;
static __device__ void VectorAdd(double2 *a, double2 *b, int size)
{
  double2 tmp1, tmp2;
  for (int i = threadIdx.x; i < size; i += blockDim.x)
  {
    tmp1 = a[i];
    tmp2 = b[i];
    a[i] = {tmp1.x + tmp2.x, tmp1.y + tmp2.y};
  }
}

static __global__ void StridedVectorReduceKernel(double *d_in, int length, int num_vectors, int stride)
{

  cooperative_groups::grid_group grid = cooperative_groups::this_grid();
  int bid = blockIdx.x;
  int offset = 1;
  double2 * d_in2 = reinterpret_cast<double2 *>(d_in);
  for (int s = gridDim.x; s > 1; s = (s+1)/2)
  {
    grid.sync();
    int index = 2 * offset * bid;
    int index2 = index + offset;
    if (bid < s && index2 < num_vectors)
    {
      VectorAdd(d_in2 + (index * length * stride)/2, d_in2 + (index2 * length * stride)/2, (length + 1) / 2);
    }
    offset *= 2;
  }
  grid.sync();
  if (bid == 0 && offset < num_vectors)
  {
    VectorAdd(d_in2, d_in2 + (offset * length * stride)/2, (length + 1) / 2);
  }
}

__global__ void  SVR2Kernel(volatile double *d_in, int num_vectors, int length, int stride, int *bc)
{
  __shared__ int my_block;
  int t = threadIdx.x;
  double val;
  for (int j = t; j < length; j += blockDim.x){
    val = 0;
    for (int i = blockIdx.x; i < num_vectors; i+=gridDim.x)
        val += d_in[i*length+j];
    d_in[blockIdx.x*length+j] = val;
  }
  __threadfence();
  // use block draining
  if (t == 0) my_block = atomicAdd(bc, 1);
  __syncthreads();
  if (my_block == gridDim.x-1){
    // I am last block
    for (int j = t; j < length; j += blockDim.x){
        val = d_in[j];
        for (int i = 1; i < gridDim.x; i++) val += d_in[i*length+j];
        d_in[j] = val;
    }
  }

}

__global__ void  SVR3Kernel(volatile double2 *d_in, int num_vectors, int length, int stride, int *bc)
{
  __shared__ int my_block;
  int t = threadIdx.x;
  double2 val;
  for (int j = t; j < length; j += blockDim.x){
    val = {0, 0};
    for (int i = blockIdx.x; i < num_vectors; i+=gridDim.x)
    {   
        val.x += d_in[i*length + j].x;
        val.y += d_in[i*length + j].y;
    }
    d_in[blockIdx.x*length+j].x = val.x;
    d_in[blockIdx.x*length+j].y = val.y;
  }
  __threadfence();
  // use block draining
  if (t == 0) my_block = atomicAdd(bc, 1);
  __syncthreads();
  if (my_block == gridDim.x-1){
    // I am last block
    for (int j = t; j < length; j += blockDim.x){
        val.x = d_in[j].x;
        val.y = d_in[j].y;
        for (int i = 1; i < gridDim.x; i++) {
            val.x += d_in[i*length+j].x;
            val.y += d_in[i*length+j].y;
        }
        d_in[j].x = val.x;
        d_in[j].y = val.y;
    }
  }

}



void StridedVectorReduce(double * d_in, int num_vectors, int length, int stride, int device)
{
  int numBlocksPerSm = 0;
  cudaDeviceProp deviceProp;
  gpuErrchk(cudaGetDeviceProperties(&deviceProp, device));
  gpuErrchk(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocksPerSm, StridedVectorReduceKernel, MAX_BLOCK_SIZE, 0));

  int MaxNumBlocks = numBlocksPerSm * deviceProp.multiProcessorCount;
  int num_threads = std::min((length + 1) / 2, MAX_BLOCK_SIZE);
  dim3 dimBlock(num_threads, 1, 1);

  int chunks = (num_vectors + 2*MaxNumBlocks -1)/(2*MaxNumBlocks);
  double *start;
  int num;
  for (int i = 0; i < chunks - 1; i++)
  {
    num = 2 * MaxNumBlocks;
    start = d_in + i * stride * num * length;
    void *kernelArgs[] = {(void *)&start, (void *)&length, (void *)&num, (void *)&stride};
    dim3 dimGrid(MaxNumBlocks, 1, 1);
    gpuErrchk(cudaLaunchCooperativeKernel((void *)StridedVectorReduceKernel, dimGrid, dimBlock, kernelArgs));
  }
  num = num_vectors - 2 * (chunks - 1) * MaxNumBlocks;
  start = d_in + (chunks - 1) * 2 * stride * MaxNumBlocks * length;
  void *kernelArgs[] = {(void *)&start, (void *)&length, (void *)&num, (void *)&stride};
  dim3 dimGrid((num+1)/2, 1, 1);
  gpuErrchk(cudaLaunchCooperativeKernel((void *)StridedVectorReduceKernel, dimGrid, dimBlock, kernelArgs));

  if (chunks == 1)
    return;
  StridedVectorReduce(d_in, chunks, length, stride * 2 * MaxNumBlocks, device);
}
void SVR2(double * d_in, int num_vectors, int length, int stride, int device, int *bc)
{
  int numBlocksPerSm = 0;
  cudaDeviceProp deviceProp;
  gpuErrchk(cudaGetDeviceProperties(&deviceProp, device));
  gpuErrchk(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocksPerSm, StridedVectorReduceKernel, MAX_BLOCK_SIZE, 0));

  int MaxNumBlocks = numBlocksPerSm * deviceProp.multiProcessorCount;
  int num_threads = std::min((length + 1) / 2, MAX_BLOCK_SIZE);
  dim3 dimBlock(num_threads, 1, 1);
  dim3 dimGrid(MaxNumBlocks, 1, 1);
  SVR2Kernel<<<dimGrid, dimBlock>>>(d_in, num_vectors, length, stride, bc);
}

void SVR3(double * d_in, int num_vectors, int length, int stride, int device, int *bc)
{
  int numBlocksPerSm = 0;
  cudaDeviceProp deviceProp;
  gpuErrchk(cudaGetDeviceProperties(&deviceProp, device));
  gpuErrchk(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocksPerSm, StridedVectorReduceKernel, MAX_BLOCK_SIZE, 0));

  int MaxNumBlocks = numBlocksPerSm * deviceProp.multiProcessorCount;
  int num_threads = std::min((length + 3) / 4, MAX_BLOCK_SIZE);
  dim3 dimBlock(num_threads, 1, 1);
  dim3 dimGrid(MaxNumBlocks, 1, 1);

  SVR3Kernel<<<dimGrid, dimBlock>>>(reinterpret_cast<double2 *>(d_in), num_vectors, length/2, stride, bc);
}

int main(int argc, char **argv)
{
    int device = 0;
    gpuErrchk(cudaSetDevice(device));
    int num_vectors = atoi(argv[1]);
    int length = atoi(argv[2]);
    double *r = (double *)malloc(sizeof(double)*length);
    double *h_in = (double *)malloc(sizeof(double) * length * num_vectors);
    for (int k = 0; k < num_vectors; k++)
    {
    for (int i = 0; i < length / 2; i++)
        h_in[k * length + i] = i+1;
    for (int i = length / 2; i < length; i++)
        h_in[k * length + i] = 0.0;
    }

    double *d_in;
    gpuErrchk(cudaMalloc((void **)&d_in, sizeof(double) * length * num_vectors));
    gpuErrchk(cudaMemcpy(d_in, h_in, sizeof(double) * length * num_vectors, cudaMemcpyHostToDevice));
    unsigned long long dt = dtime_usec(0);
    StridedVectorReduce(d_in, num_vectors, length / 2, 2, device);
    cudaDeviceSynchronize();
    dt = dtime_usec(dt);
    gpuErrchk(cudaMemcpy(r, d_in, sizeof(double) * length, cudaMemcpyDeviceToHost));
    int start = (length / 2 < 100) ? 0 : length / 2 - 10;
    for (int i = start; i < length / 2; i++)
      printf("h_out: %i %f \n", i, r[i]);
    // output should be [1, 2, 3, ... , length / 2] * num_vectors
    printf("duration 1: %f seconds\n", dt/(float)USECPSEC);
    gpuErrchk(cudaMemcpy(d_in, h_in, sizeof(double) * length * num_vectors, cudaMemcpyHostToDevice));
    int *bc;
    cudaMalloc(&bc, sizeof(int));
    cudaMemset(bc, 0, sizeof(int));
    dt = dtime_usec(0);
    SVR2(d_in, num_vectors, length, 2, device, bc);
    cudaDeviceSynchronize();
    dt = dtime_usec(dt);
    gpuErrchk(cudaMemcpy(r, d_in, sizeof(double) * length, cudaMemcpyDeviceToHost));
    start = (length / 2 < 100) ? 0 : length / 2 - 10;
    for (int i = start; i < length / 2; i++)
      printf("h_out: %i %f \n", i, r[i]);
    // output should be [1, 2, 3, ... , length / 2] * num_vectors
    printf("duration 2 (double): %f seconds\n", dt/(float)USECPSEC);
    gpuErrchk(cudaMemcpy(d_in, h_in, sizeof(double) * length * num_vectors, cudaMemcpyHostToDevice));
    cudaMalloc(&bc, sizeof(int));
    cudaMemset(bc, 0, sizeof(int));
    dt = dtime_usec(0);
    SVR3(d_in, num_vectors, length, 2, device, bc);
    cudaDeviceSynchronize();
    dt = dtime_usec(dt);
    gpuErrchk(cudaMemcpy(r, d_in, sizeof(double) * length, cudaMemcpyDeviceToHost));
    start = (length / 2 < 100) ? 0 : length / 2 - 10;
    for (int i = start; i < length / 2; i++)
      printf("h_out: %i %f \n", i, r[i]);
    // output should be [1, 2, 3, ... , length / 2] * num_vectors
    printf("duration 3 (double2): %f seconds\n", dt/(float)USECPSEC);
    free(h_in);
    gpuErrchk(cudaFree(d_in));

    return 0;
}

// output with num_vectors = 100000 and length = 5000
h_out: 2490 249100000.000000 
h_out: 2491 249200000.000000 
h_out: 2492 249300000.000000 
h_out: 2493 249400000.000000 
h_out: 2494 249500000.000000 
h_out: 2495 249600000.000000 
h_out: 2496 249700000.000000 
h_out: 2497 249800000.000000 
h_out: 2498 249900000.000000 
h_out: 2499 250000000.000000 
duration 1: 0.009309 seconds
h_out: 2490 249100000.000000 
h_out: 2491 249200000.000000 
h_out: 2492 249300000.000000 
h_out: 2493 249400000.000000 
h_out: 2494 249500000.000000 
h_out: 2495 249600000.000000 
h_out: 2496 249700000.000000 
h_out: 2497 249800000.000000 
h_out: 2498 249900000.000000 
h_out: 2499 250000000.000000 
duration 2 (double): 0.003667 seconds
h_out: 2490 249100000.000000 
h_out: 2491 249200000.000000 
h_out: 2492 249300000.000000 
h_out: 2493 249400000.000000 
h_out: 2494 249500000.000000 
h_out: 2495 249600000.000000 
h_out: 2496 249700000.000000 
h_out: 2497 249800000.000000 
h_out: 2498 249900000.000000 
h_out: 2499 250000000.000000 
duration 3 (double2): 0.004646 seconds

This algorithm does work correctly and computes the desired reduction. However, I have no idea whether it is the correct approach for this problem. I read about the Thrust strided iterators on some other posts, but that approach seems to be at least 2x slower than what I have here since I have to make a call to thrust::reduce() for each element of the block (so O(10^3) calls). I also don't know if there is a way to have the IFFT output organized in a way that makes the reduction part easier.

Any guidance on this would be greatly appreciated! This is my first foray into GPU programming and my first ever question on Stack Overflow, so I apologize in advance for any mistakes on my part.

Edit: I updated the code to include a full test case. I realize I was not clear as to how I would call the StridedVectorReduce function. How I have it now is length refers to the size of each padded block, num_vectors refers to the number of (padded) blocks, and stride refers to the steps of size length between each block we want to sum. So, the function is called as StridedVectorReduce(d_in, num_vectors, length / 2, 2, 0), where length/2 is the size of the unpadded block, and there is a step of 2 unpadded block sizes between each of the blocks we want to sum. I haven't included the gpuErrchk() code so as to not make it too long, but it should work without that.

This code should have no errors under compute-sanitizer. I have also removed the std::ceil() calls and replaced them with the integer ceiling division. I will try the grid stride loop in the kernel next, though I am not sure how to only use one grid sync in the kernel. I thought we need to sync between each level of the binary tree.

Edit 2: Using the code from Robert Crovella's answer, I have updated the code above to include my implementation of their suggestion to use horizontal and vertical striding. There is no need to handle any strides other than 2 with the new algorithm, so that parameter is not actually used in the code. I also tried making a double 2 version to see if it would do any better. It ends up doing a bit worse than the pure double version. I'm not sure if that's because of the lines like val.x = d_in[j].x; val.y = d_in[j].y; causing multiple memory reads. I wanted to create a temp variable and do something like double2 tmp = d_in[j] instead, but that was causing compiler errors since tmp is not volatile. Anyways, thanks a lot for the help, and sorry for the incompleteness in my previous code. This one should compile directly as written.


Solution

  • Here's an approach which seems to be faster for me. I'm using a grid-striding approach to allow each block to accumulate an arbitrary number of vector results, then I use a block-draining strategy to perform the final reduction. In so doing I get rid of the need for the grid sync mechanism, as well as get rid of the need for multiple kernel launches.

    The benefit of this approach will vary based on the size of your GPU as well as the size of your dataset, but in my case (V100) seems to give about a 3x speedup over your method. A smaller GPU and/or larger data set than my test case would probably show increasing benefit. (Indeed, when I increase the num_vectors in my test case to 100,000, I get about a 6x ratio instead of 3x).

    I have not accounted for the case in my kernel where the vector length (i.e. the values to actually be summed - you are using length inconsistently in your code) is greater than 1024, but this could be accomplished by vertical striding in addition to horizontal striding - the effect should be linear on increasing vector size. Furthermore, I have not accounted for the use of stride, but your test case anyway seems to have stride more-or-less hard-coded to 2 (at least in the data setup) and again it should just be a modification to indexing. Therefore this should be considered a roadmap or demonstrator. The indexing is fairly simple in my kernel, benefitted by the fact that we don't have to deal with all the chunking of data.

    Here is a demonstrator:

    $ cat t2255.cu
    #include <stdio.h>
    #include <iostream>
    #include <stdlib.h>
    #include <cuda_runtime.h>
    #include <cmath>
    #include <cooperative_groups.h>
    #include <time.h>
    #include <sys/time.h>
    #define USECPSEC 1000000ULL
    
    unsigned long long dtime_usec(unsigned long long start=0){
    
      timeval tv;
      gettimeofday(&tv, 0);
      return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
    }
    #define gpuErrchk(x) x
    const int MAX_BLOCK_SIZE=1024;
    static __device__ void VectorAdd(double2 *a, double2 *b, int size)
    {
      double2 tmp1, tmp2;
      for (int i = threadIdx.x; i < size; i += blockDim.x)
      {
        tmp1 = a[i];
        tmp2 = b[i];
        a[i] = {tmp1.x + tmp2.x, tmp1.y + tmp2.y};
      }
    }
    
    static __global__ void StridedVectorReduceKernel(double *d_in, int length, int num_vectors, int stride)
    {
    
      cooperative_groups::grid_group grid = cooperative_groups::this_grid();
      int bid = blockIdx.x;
      int offset = 1;
      double2 * d_in2 = reinterpret_cast<double2 *>(d_in);
      for (int s = gridDim.x; s > 1; s = (s+1)/2)
      {
        grid.sync();
        int index = 2 * offset * bid;
        int index2 = index + offset;
        if (bid < s && index2 < num_vectors)
        {
          VectorAdd(d_in2 + (index * length * stride)/2, d_in2 + (index2 * length * stride)/2, (length + 1) / 2);
        }
        offset *= 2;
      }
      grid.sync();
      if (bid == 0 && offset < num_vectors)
      {
        VectorAdd(d_in2, d_in2 + (offset * length * stride)/2, (length + 1) / 2);
      }
    }
    
    __global__ void  SVR2Kernel(volatile double *d_in, int num_vectors, int length, int stride, int *bc)
    {
      __shared__ int my_block;
      int t = threadIdx.x;
      double val = 0;
      for (int i = blockIdx.x; i < num_vectors; i+=gridDim.x)
        val += d_in[i*length+t];
      d_in[blockIdx.x*length+t] = val;
      __threadfence();
      // use block draining
      if (t == 0) my_block = atomicAdd(bc, 1);
      __syncthreads();
      if (my_block == gridDim.x-1){
        // I am last block
        val = d_in[t];
        for (int i = 1; i < gridDim.x; i++) val += d_in[i*length+t];
        d_in[t] = val;
      }
    
    }
    void StridedVectorReduce(double * d_in, int num_vectors, int length, int stride, int device)
    {
      int numBlocksPerSm = 0;
      cudaDeviceProp deviceProp;
      gpuErrchk(cudaGetDeviceProperties(&deviceProp, device));
      gpuErrchk(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocksPerSm, StridedVectorReduceKernel, MAX_BLOCK_SIZE, 0));
    
      int MaxNumBlocks = numBlocksPerSm * deviceProp.multiProcessorCount;
      int num_threads = std::min((length + 1) / 2, MAX_BLOCK_SIZE);
      dim3 dimBlock(num_threads, 1, 1);
    
      int chunks = (num_vectors + 2*MaxNumBlocks -1)/(2*MaxNumBlocks);
      double *start;
      int num;
      for (int i = 0; i < chunks - 1; i++)
      {
        num = 2 * MaxNumBlocks;
        start = d_in + i * stride * num * length;
        void *kernelArgs[] = {(void *)&start, (void *)&length, (void *)&num, (void *)&stride};
        dim3 dimGrid(MaxNumBlocks, 1, 1);
        gpuErrchk(cudaLaunchCooperativeKernel((void *)StridedVectorReduceKernel, dimGrid, dimBlock, kernelArgs));
      }
      num = num_vectors - 2 * (chunks - 1) * MaxNumBlocks;
      start = d_in + (chunks - 1) * 2 * stride * MaxNumBlocks * length;
      void *kernelArgs[] = {(void *)&start, (void *)&length, (void *)&num, (void *)&stride};
      dim3 dimGrid((num+1)/2, 1, 1);
      gpuErrchk(cudaLaunchCooperativeKernel((void *)StridedVectorReduceKernel, dimGrid, dimBlock, kernelArgs));
    
      if (chunks == 1)
        return;
      StridedVectorReduce(d_in, chunks, length, stride * 2 * MaxNumBlocks, device);
    }
    void SVR2(double * d_in, int num_vectors, int length, int stride, int device, int *bc)
    {
      int numBlocksPerSm = 0;
      cudaDeviceProp deviceProp;
      gpuErrchk(cudaGetDeviceProperties(&deviceProp, device));
      gpuErrchk(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocksPerSm, StridedVectorReduceKernel, MAX_BLOCK_SIZE, 0));
    
      int MaxNumBlocks = numBlocksPerSm * deviceProp.multiProcessorCount;
      int num_threads = std::min((length + 1) / 2, MAX_BLOCK_SIZE);
      dim3 dimBlock(num_threads, 1, 1);
      dim3 dimGrid(MaxNumBlocks, 1, 1);
      SVR2Kernel<<<dimGrid, dimBlock>>>(d_in, num_vectors, length, stride, bc);
    }
    
    int main(int argc, char **argv)
    {
        int device = 0;
        gpuErrchk(cudaSetDevice(device));
        int num_vectors = atoi(argv[1]);
        int length = atoi(argv[2]);
        double *r = (double *)malloc(sizeof(double)*length);
        double *h_in = (double *)malloc(sizeof(double) * length * num_vectors);
        for (int k = 0; k < num_vectors; k++)
        {
        for (int i = 0; i < length / 2; i++)
            h_in[k * length + i] = i+1;
        for (int i = length / 2; i < length; i++)
            h_in[k * length + i] = 0.0;
        }
    
        double *d_in;
        gpuErrchk(cudaMalloc((void **)&d_in, sizeof(double) * length * num_vectors));
        gpuErrchk(cudaMemcpy(d_in, h_in, sizeof(double) * length * num_vectors, cudaMemcpyHostToDevice));
        unsigned long long dt = dtime_usec(0);
        StridedVectorReduce(d_in, num_vectors, length / 2, 2, device);
        cudaDeviceSynchronize();
        dt = dtime_usec(dt);
        gpuErrchk(cudaMemcpy(r, d_in, sizeof(double) * length, cudaMemcpyDeviceToHost));
        int start = (length / 2 < 100) ? 0 : length / 2 - 10;
        for (int i = start; i < length / 2; i++)
          printf("h_out: %i %f \n", i, r[i]);
        // output should be [1, 2, 3, ... , length / 2] * num_vectors
        printf("duration 1: %f seconds\n", dt/(float)USECPSEC);
        gpuErrchk(cudaMemcpy(d_in, h_in, sizeof(double) * length * num_vectors, cudaMemcpyHostToDevice));
        int *bc;
        cudaMalloc(&bc, sizeof(int));
        cudaMemset(bc, 0, sizeof(int));
        dt = dtime_usec(0);
        SVR2(d_in, num_vectors, length, 2, device, bc);
        cudaDeviceSynchronize();
        dt = dtime_usec(dt);
        gpuErrchk(cudaMemcpy(r, d_in, sizeof(double) * length, cudaMemcpyDeviceToHost));
        start = (length / 2 < 100) ? 0 : length / 2 - 10;
        for (int i = start; i < length / 2; i++)
          printf("h_out: %i %f \n", i, r[i]);
        // output should be [1, 2, 3, ... , length / 2] * num_vectors
        printf("duration 2: %f seconds\n", dt/(float)USECPSEC);
        free(h_in);
        gpuErrchk(cudaFree(d_in));
    
        return 0;
    }
    $ nvcc -o t2255 t2255.cu
    $ ./t2255 10000 1000
    h_out: 490 4910000.000000
    h_out: 491 4920000.000000
    h_out: 492 4930000.000000
    h_out: 493 4940000.000000
    h_out: 494 4950000.000000
    h_out: 495 4960000.000000
    h_out: 496 4970000.000000
    h_out: 497 4980000.000000
    h_out: 498 4990000.000000
    h_out: 499 5000000.000000
    duration 1: 0.001469 seconds
    h_out: 490 4910000.000000
    h_out: 491 4920000.000000
    h_out: 492 4930000.000000
    h_out: 493 4940000.000000
    h_out: 494 4950000.000000
    h_out: 495 4960000.000000
    h_out: 496 4970000.000000
    h_out: 497 4980000.000000
    h_out: 498 4990000.000000
    h_out: 499 5000000.000000
    duration 2: 0.000549 seconds
    $
    

    You didn't provide definitions for MAX_BLOCK_SIZE nor gpuErrchk so your posted code doesn't actually compile; I provided trivial definitions for those.