CUDA - Parallel Reduction over one axis

I am fairly new to CUDA programming and I am trying to write a CUDA kernel for parallel reduction over only 1 dimension of a 3-dimensional tensor which is a row-major flattened float array fed into the kernel.

In other words, I am trying to rewrite numpy.sum with limited axes combinations of axis=0, axis=1 and axis=2.

I have successfully implemented "reduce over axis 0" and "reduce over axis 1" but performance issues for "reduce over axis2" made me post a question here to ask for advice.

The kernel is launched with a 1-D grid and 1-D block configuration and it maps each thread into each element of reduced output tensor. So, it should be something like this:

Here is my kernel:

__global__ void kernel_reduce_sum_3d_try02(
        float* g_idata,
        float* g_odata,
        int dim0,
        int dim1,
        int dim2,
        int overaxis0,
        int overaxis1,
        int overaxis2)
{

    if (overaxis0 == 0 && overaxis1 == 0 && overaxis2 == 1) { 
        // static shared memory
        __shared__ float smem_store[BLOCK_SIZE];

        // set thread ID
        //unsigned int tid = threadIdx.x;
        unsigned int tid =  threadIdx.x;

        // global index
        unsigned int idx = (blockIdx.x * blockDim.x + threadIdx.x);

        // unrolling
        float tmpSum0 = 0;
        unsigned int i = 0;
        unsigned int src_index ;
        unsigned int _limit = (unsigned int)(dim0 * dim1 * dim2);

        //Indices over output 
        int thrd_d0 = (idx) / (dim1*1);
        int thrd_d1 = (idx - thrd_d0*dim1);

        //Only for debugging 
        printf("tid: %03d \tidx: %03d\td0: %02d\td1: %02d\n",tid,idx,thrd_d0,thrd_d1);


         
        for (i = 0; i < dim2; i++) {
            src_index = thrd_d0*dim1*dim2 + thrd_d1 * dim2 + i;
            if(idx<15) printf("idx: %d : src_index: %d\n",idx,src_index);
            if(src_index < _limit)
                tmpSum0 += g_idata[src_index];
        }


        if (src_index + 0 < _limit) smem_store[tid + 0] = tmpSum0;
        __syncthreads();

        unsigned int oindx = (unsigned int)( thrd_d0*dim1 + thrd_d1 );
        if (src_index + 0 <= _limit) g_odata[oindx + 0] = smem_store[tid + 0];
    }
}

Question 1. Is there a better way to assign threads and blocks to compute output tensor?

Question 2. In my kernel how can I increase occupancy? (which is around 50%)

Question 3. How should I use shared memory for global memory read operations? (In case of a large dim2, each block should allocate huge shared memory buffer which is not good for concurrent warp execution and occupancy)

Solution

The top two optimization objectives for any CUDA programmer are:

expose enough parallelism (roughly: be able to use lots of threads)
make efficient use of the memory subsystem

For global memory, objective 2 means that we should strive for coalesced access when reading or writing to global memory. One example of coalesced access is when adjacent threads in a warp are reading adjacent locations in memory.

Does your kernel do this? It does not. Here's a trivial test case:

$ cat t1.cu
#include <stdio.h>
#define BLOCK_SIZE 256

__global__ void kernel_reduce_sum_3d_try02(
        float* g_idata,
        float* g_odata,
        int dim0,
        int dim1,
        int dim2,
        int overaxis0,
        int overaxis1,
        int overaxis2)
{

    if (overaxis0 == 0 && overaxis1 == 0 && overaxis2 == 1) {
        // static shared memory
        __shared__ float smem_store[BLOCK_SIZE];

        // set thread ID
        //unsigned int tid = threadIdx.x;
        unsigned int tid =  threadIdx.x;

        // global index
        unsigned int idx = (blockIdx.x * blockDim.x + threadIdx.x);

        // unrolling
        float tmpSum0 = 0;
        unsigned int i = 0;
        unsigned int src_index ;
        unsigned int _limit = (unsigned int)(dim0 * dim1 * dim2);

        //Indices over output
        int thrd_d0 = (idx) / (dim1*1);
        int thrd_d1 = (idx - thrd_d0*dim1);

        //Only for debugging
        //printf("tid: %03d \tidx: %03d\td0: %02d\td1: %02d\n",tid,idx,thrd_d0,thrd_d1);



        for (i = 0; i < dim2; i++) {
            src_index = thrd_d0*dim1*dim2 + thrd_d1 * dim2 + i;
            //if(idx<15) printf("idx: %d : src_index: %d\n",idx,src_index);
            if(src_index < _limit){
                tmpSum0 += g_idata[src_index];
                if ((blockIdx.x == 0) && (i == 0) && (threadIdx.x < 32)) printf("thread: %d, src_index: %u\n", threadIdx.x, src_index);
        }


        if (src_index + 0 < _limit) smem_store[tid + 0] = tmpSum0;
        __syncthreads();

        unsigned int oindx = (unsigned int)( thrd_d0*dim1 + thrd_d1 );
        if (src_index + 0 <= _limit) g_odata[oindx + 0] = smem_store[tid + 0];
    }
  }
}


int main(){



  size_t dim = 256;
  size_t sz = dim*dim*dim*sizeof(float);
  float *g_idata, *g_odata;
  cudaMalloc(&g_idata, sz);
  cudaMalloc(&g_odata, sz);
  kernel_reduce_sum_3d_try02<<<dim/BLOCK_SIZE, BLOCK_SIZE>>>(
        g_idata,
        g_odata,
        dim,
        dim,
        dim,
        0,
        0,
        1);
  cudaDeviceSynchronize();
}
$ nvcc -o t1 t1.cu
$ cuda-memcheck ./t1
========= CUDA-MEMCHECK
thread: 0, src_index: 0
thread: 1, src_index: 256
thread: 2, src_index: 512
thread: 3, src_index: 768
thread: 4, src_index: 1024
thread: 5, src_index: 1280
thread: 6, src_index: 1536
thread: 7, src_index: 1792
thread: 8, src_index: 2048
thread: 9, src_index: 2304
thread: 10, src_index: 2560
thread: 11, src_index: 2816
thread: 12, src_index: 3072
thread: 13, src_index: 3328
thread: 14, src_index: 3584
thread: 15, src_index: 3840
thread: 16, src_index: 4096
thread: 17, src_index: 4352
thread: 18, src_index: 4608
thread: 19, src_index: 4864
thread: 20, src_index: 5120
thread: 21, src_index: 5376
thread: 22, src_index: 5632
thread: 23, src_index: 5888
thread: 24, src_index: 6144
thread: 25, src_index: 6400
thread: 26, src_index: 6656
thread: 27, src_index: 6912
thread: 28, src_index: 7168
thread: 29, src_index: 7424
thread: 30, src_index: 7680
thread: 31, src_index: 7936
========= ERROR SUMMARY: 0 errors
$

We see that on one particular access, the threads in a warp are reading from locations that are separated by an index distance of 256 (ideally we would like this "distance" to be 1 as we go from thread to thread in a warp, for "coalesced" access to global memory).

So is this possible? It is possible for each of the 3 sum directions, although there will need to be somewhat different methodologies/realizations applied in each case. We'll come back to this.

We also want to expose enough parallelism, which can roughly translate into "be able to launch kernels with ~10,000 or more threads". The behavior here will vary by GPU architecture and other factors, but this is a reasonable starting point for general understanding. A kernel with 256 threads total (for example) would not be enough to saturate any current CUDA GPU.

Given that this sum function will work on 3-dimensional data, let's start by considering a 3-dimensional array of ~256 elements per dimension. If we chose to create just 256 threads, and work their way through the array, that would be too small for objective 1. So let's think about realizations that can handle 256x256 (or more) threads.

Case 1: summing in the X-direction

I will assume C-style storage, therefore the X direction will be in the direction of linear storage in memory. Therefore as we progress from element to element in a row, we are increasing memory storage index by 1. Summing along these "rows", therefore, will require that adjacent threads read adjacent elements belonging to a particular sum. Therefore to allow adjacent threads to do this, but then to cooperate and work together to form the sum, we will use a classical parallel reduction method. We will choose a grid of 256 threads per block (each block cooperating to form a sum) and one block per sum (65536 total blocks in this case, so 65536*256 total threads, meeting our first objective) and each block of 256 threads will be responsible for computing one row sum. To facilitate this, we will choose a number of blocks equal to the Y-dimension of the array (256 in our case) times the Z-dimension in our array (256 in our example) and 256 threads be block. Each block will be responsible for computing one sum. This will be the only case that will use or need shared memory.

Case 2: summing in the Y-direction

We could refer to this as "summing columns in the Y direction". In order to satisfy our second objective, we require that each thread in a warp read an adjacent element, but adjacent elements in memory now belong to separate Y-column sums. An efficient realization in this case, if we can keep enough threads busy, is to compute a separate sum in each thread. Each thread traverses down a Y-column, and keeps a running sum in a loop. A single Z-sheet (a slice in the X-Y plane) would only require 256 threads total to compute that for our example case, but we can increase the number of threads by working on all 256 sheets (in our example case) at the same time. So we can employ 256x256 = 65536 threads, meeting our first objective.

Case 3: summing in the Z-direction (your example case in your question)

This case is just a small variation on Case 2. Once again, to satisfy objective 2, we desire that adjacent threads in a warp read adjacent locations in memory. Once again, these belong to separate sums. Now, however, we want the threads to traverse columns in the Z-direction, rather than columns in the Y-direction. So the indexing will be slightly different, but overall the realization will look quite similar to Case 2. We will also employ a grid of 256x256 threads, where each thread is keeping a separate running sum. However the starting point for each running sum will be the first "sheet", then iterate through the next sheet and the next sheet in the Z-direction, summing "Z-columns", one per thread.

Here's an example demonstrating several cases:

$ cat t2.cu
#include <stdio.h>
#include <assert.h>

// modifiable
typedef float mytype;  // consider possibility of overflow e.g. for char types
const size_t ddimX = 32768;
const size_t ddimY = 100;
const size_t ddimZ = 100;

//don't modify
const int nTPB=256;

template <typename T>
__inline__ __device__
T warpReduceSum(T val) {
  for (int offset = warpSize/2; offset > 0; offset /= 2)
    val += __shfl_down_sync(0xFFFFFFFF, val, offset); // requires CUDA 9+
  return val;
}


template <typename T>
__inline__ __device__
T blockReduceSum(T val) {

  static __shared__ T shared[32]; // Shared mem for 32 partial sums
  int lane = threadIdx.x % warpSize;
  int wid = threadIdx.x / warpSize;
  val = warpReduceSum(val);     // Each warp performs partial reduction
  if (lane==0) shared[wid]=val; // Write reduced value to shared memory
  __syncthreads();              // Wait for all partial reductions

                      //read from shared memory only if that warp existed
  val = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0;

  if (wid==0) val = warpReduceSum(val); //Final reduce within first warp
  return val;
}

template <typename T>
__global__ void kernel_reduce_sum_3d(
        const T * __restrict__  g_idata,
        T * __restrict__  g_odata,
        const size_t dim0,
        const size_t dim1,
        const size_t dim2,
        const bool overaxis0,
        const bool overaxis1,
        const bool overaxis2)
{
  if (overaxis0){
    size_t bidx = blockIdx.x;
    size_t tidx = threadIdx.x;
    size_t limit;
    size_t base;
    T res = 0;
    if (!overaxis1 && !overaxis2){
    // Case 1 - sums in X-direction
    // each threadblock is responsible for a separate row sum
      limit = dim0;
      base = bidx*dim0;
      while (tidx < limit){
        res += g_idata[base+tidx];
        tidx += blockDim.x;}} // block-stride loop
    else if (overaxis1 && !overaxis2){
    // Case 4 - sums in X-Y plane
    // each threadblock will be responsible for an X-Y plane
      limit = dim0*dim1;
      base = bidx*dim0*dim1;
      while (tidx < limit){
        res += g_idata[base+tidx];
        tidx += blockDim.x;}} // block-stride loop
    else if (!overaxis1 && overaxis2){
    // Case 5 - sums in X-Z plane
    // each threadblock will be responsible for an X-Z plane
      for (int i = 0; i < dim2; i++){
      tidx = threadIdx.x;
      limit = dim0;
      base = (bidx*dim0)+(i*dim0*dim1);
      while (tidx < limit){
        res += g_idata[base+tidx];
        tidx += blockDim.x;}}} // block-stride loop
    else assert(0); // not implemented! - the remaining case here is all 3 axes selected
#ifndef USE_WS
    __shared__ T sm[nTPB];
    sm[threadIdx.x] = res;
    __syncthreads();
    // parallel reduction
    for (int i = blockDim.x>>1; i > warpSize; i>>=1){
      if (threadIdx.x < i) sm[threadIdx.x] += sm[threadIdx.x + i];
      __syncthreads();}
    for (int i = (blockDim.x == warpSize)?warpSize>>1:warpSize; i > 0; i>>=1){
      if (threadIdx.x < i) sm[threadIdx.x] += sm[threadIdx.x + i];
      if (threadIdx.x < warpSize) __syncwarp();}
    if (!threadIdx.x) g_odata[bidx] = sm[0];
#else
    res = blockReduceSum(res);
    if (!threadIdx.x) g_odata[bidx] = res;
#endif
    }
  else if (!overaxis0 && overaxis1 && !overaxis2){
    // Case 2 - sums in Y-direction
    // each thread is responsible for a separate Y-column sum
    size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < (dim0*dim2)){
      size_t tidx = idx%dim0 + (idx/dim0)*(dim0*dim1);
      T tsum = 0;
      for (size_t i = 0; i < dim1; i++){
        tsum += g_idata[tidx];
        tidx += dim0;}
      g_odata[idx] = tsum;
      }
    }
  else if (!overaxis0 && overaxis1 && overaxis2){
    // Case 6 - sums in Y-Z plane
    // each thread is responsible for a separate Y-Z plane sum (probably not optimal)
    size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < (dim0)){
      size_t tidx = idx;
      T tsum = 0;
      for (size_t i = 0; i < dim1*dim2; i++){
        tsum += g_idata[tidx];
        tidx += dim0;}
      g_odata[idx] = tsum;
      }
    }
  else if (!overaxis0 && !overaxis1 && overaxis2){
    // Case 3 - sums in Z-direction
    // each thread is responsible for a separate Z-column sum
    size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < (dim0*dim1)){
      size_t tidx = idx;
      T tsum = 0;
      for (size_t i = 0; i < dim2; i++){
        tsum += g_idata[tidx];
        tidx += dim0*dim1;}
      g_odata[idx] = tsum;
      }
    }
  else assert(0); // not implemented! - the remaining case here is no axes selected
}

template <typename T>
bool validate(T *data, size_t dim, T val){
  for (size_t i = 0; i < dim; i++)
    if (data[i] != val) {printf("mismatch at %lu, was: %f, should be: %f\n", i, (float)(data[i]), (float)val); return false;}
  return true;
}

size_t choose_block_size(size_t val){
  if (val >= nTPB) return nTPB;
  if (val <= 32) return 32;
  val = (val >> 1) | val;
  val = (val >> 2) | val;
  val = (val >> 4) | val;
  val = (val >> 8) | val;
  val = (val >> 16) | val;
  val++;
  return val;
}

int main(int argc, char *argv[]){
  size_t dimX = ddimX;
  size_t dimY = ddimY;
  size_t dimZ = ddimZ;
  if (argc > 1) {
    size_t nx = atoi(argv[1]);
    dimX = nx;
    dimY = nx-1;
    dimZ = nx-2;}
  // setup input array of all 1
  const size_t sz = dimX*dimY*dimZ*sizeof(mytype);
  mytype *d_in, *d_out, *h_in, *h_out;
  size_t rsz;
  h_in = new mytype[dimX*dimY*dimZ];
  assert(h_in != NULL);
  cudaError_t err = cudaMalloc(&d_in, sz);
  for (size_t i = 0; i < dimX*dimY*dimZ; i++) h_in[i] = 1;
  if (err != cudaSuccess) {printf("cudaMalloc1 error: %s\n", cudaGetErrorString(err)); return -1;}
  for (size_t i = 0; i < dimX*dimY*dimZ; i++) h_in[i] = 1;
  err = cudaMemcpy(d_in, h_in, sz, cudaMemcpyHostToDevice);
  if (err != cudaSuccess) {printf("cudaMemcpy1 error: %s\n", cudaGetErrorString(err)); return -1;}
  // Test X-direction sums
  printf("testing X-direction sums\n");
  rsz = dimY*dimZ*sizeof(mytype);
  h_out=new mytype[dimY*dimZ];
  err = cudaMalloc(&d_out, rsz);
  if (err != cudaSuccess) {printf("cudaMalloc2 error: %s\n", cudaGetErrorString(err)); return -1;}
  err = cudaMemset(d_out, 0, rsz);
  if (err != cudaSuccess) {printf("cudaMemset1 error: %s\n", cudaGetErrorString(err)); return -1;}
  kernel_reduce_sum_3d<<<dimY*dimZ, choose_block_size(dimX)>>>(d_in, d_out, dimX, dimY, dimZ, true, false, false);
  err = cudaMemcpy(h_out, d_out, rsz, cudaMemcpyDeviceToHost);
  if (err != cudaSuccess) {printf("result1 error: %s\n", cudaGetErrorString(err)); return -1;}
  if (!validate(h_out, dimY*dimZ, (mytype)dimX))  return -1;
  cudaFree(d_out);
  delete[] h_out;
  // Test Y-direction sums
  printf("testing Y-direction sums\n");
  rsz = dimX*dimZ*sizeof(mytype);
  h_out=new mytype[dimX*dimZ];
  err = cudaMalloc(&d_out, rsz);
  if (err != cudaSuccess) {printf("cudaMalloc3 error: %s\n", cudaGetErrorString(err)); return -1;}
  err = cudaMemset(d_out, 0, rsz);
  if (err != cudaSuccess) {printf("cudaMemset2 error: %s\n", cudaGetErrorString(err)); return -1;}
  kernel_reduce_sum_3d<<<((dimX*dimZ)+(nTPB-1))/nTPB, nTPB>>>(d_in, d_out, dimX, dimY, dimZ, false, true, false);
  err = cudaMemcpy(h_out, d_out, rsz, cudaMemcpyDeviceToHost);
  if (err != cudaSuccess) {printf("result2 error: %s\n", cudaGetErrorString(err)); return -1;}
  if (!validate(h_out, dimX*dimZ, (mytype)dimY))  return -1;
  cudaFree(d_out);
  delete[] h_out;
  // Test Z-direction sums
  printf("testing Z-direction sums\n");
  rsz = dimX*dimY*sizeof(mytype);
  h_out=new mytype[dimX*dimY];
  err = cudaMalloc(&d_out, rsz);
  if (err != cudaSuccess) {printf("cudaMalloc4 error: %s\n", cudaGetErrorString(err)); return -1;}
  err = cudaMemset(d_out, 0, rsz);
  if (err != cudaSuccess) {printf("cudaMemset3 error: %s\n", cudaGetErrorString(err)); return -1;}
  kernel_reduce_sum_3d<<<((dimX*dimY)+(nTPB-1))/nTPB, nTPB>>>(d_in, d_out, dimX, dimY, dimZ, false, false, true);
  err = cudaMemcpy(h_out, d_out, rsz, cudaMemcpyDeviceToHost);
  if (err != cudaSuccess) {printf("result3 error: %s\n", cudaGetErrorString(err)); return -1;}
  if (!validate(h_out, dimX*dimY, (mytype)dimZ))  return -1;
  cudaFree(d_out);
  delete[] h_out;
  // Test X-Y plane sums
  printf("testing X-Y plane sums\n");
  rsz = dimZ*sizeof(mytype);
  h_out=new mytype[dimZ];
  err = cudaMalloc(&d_out, rsz);
  if (err != cudaSuccess) {printf("cudaMalloc5 error: %s\n", cudaGetErrorString(err)); return -1;}
  err = cudaMemset(d_out, 0, rsz);
  if (err != cudaSuccess) {printf("cudaMemset4 error: %s\n", cudaGetErrorString(err)); return -1;}
  kernel_reduce_sum_3d<<<dimZ, nTPB>>>(d_in, d_out, dimX, dimY, dimZ, true, true, false);
  err = cudaMemcpy(h_out, d_out, rsz, cudaMemcpyDeviceToHost);
  if (err != cudaSuccess) {printf("result4 error: %s\n", cudaGetErrorString(err)); return -1;}
  if (!validate(h_out, dimZ, (mytype)(dimX*dimY)))  return -1;
  cudaFree(d_out);
  delete[] h_out;
  // Test X-Z plane sums
  printf("testing X-Z plane sums\n");
  rsz = dimY*sizeof(mytype);
  h_out=new mytype[dimY];
  err = cudaMalloc(&d_out, rsz);
  if (err != cudaSuccess) {printf("cudaMalloc6 error: %s\n", cudaGetErrorString(err)); return -1;}
  err = cudaMemset(d_out, 0, rsz);
  if (err != cudaSuccess) {printf("cudaMemset5 error: %s\n", cudaGetErrorString(err)); return -1;}
  kernel_reduce_sum_3d<<<dimY, choose_block_size(dimX)>>>(d_in, d_out, dimX, dimY, dimZ, true, false, true);
  err = cudaMemcpy(h_out, d_out, rsz, cudaMemcpyDeviceToHost);
  if (err != cudaSuccess) {printf("result5 error: %s\n", cudaGetErrorString(err)); return -1;}
  if (!validate(h_out, dimY, (mytype)(dimX*dimZ)))  return -1;
  cudaFree(d_out);
  delete[] h_out;
  // Test Y-Z plane sums
  printf("testing Y-Z plane sums\n");
  rsz = dimX*sizeof(mytype);
  h_out=new mytype[dimX];
  err = cudaMalloc(&d_out, rsz);
  if (err != cudaSuccess) {printf("cudaMalloc7 error: %s\n", cudaGetErrorString(err)); return -1;}
  err = cudaMemset(d_out, 0, rsz);
  if (err != cudaSuccess) {printf("cudaMemset6 error: %s\n", cudaGetErrorString(err)); return -1;}
  size_t tpb = choose_block_size(dimX);
  kernel_reduce_sum_3d<<<(dimX+tpb-1)/tpb, tpb>>>(d_in, d_out, dimX, dimY, dimZ, false, true, true);
  err = cudaMemcpy(h_out, d_out, rsz, cudaMemcpyDeviceToHost);
  if (err != cudaSuccess) {printf("result6 error: %s\n", cudaGetErrorString(err)); return -1;}
  if (!validate(h_out, dimX, (mytype)(dimY*dimZ)))  return -1;
  cudaFree(d_out);
  delete[] h_out;
  cudaFree(d_in);
  err=cudaGetLastError();
  if (err != cudaSuccess) {printf("CUDA error: %s\n", cudaGetErrorString(err)); return -1;}
  printf("success!\n");
  delete[] h_in;
  return 0;
}
$ nvcc -o t2 t2.cu
$ cuda-memcheck ./t2
========= CUDA-MEMCHECK
testing X-direction sums
testing Y-direction sums
testing Z-direction sums
testing X-Y plane sums
testing X-Z plane sums
testing Y-Z plane sums
success!
========= ERROR SUMMARY: 0 errors
$ nvprof --print-gpu-trace ./t2
==11084== NVPROF is profiling process 11084, command: ./t2
testing X-direction sums
testing Y-direction sums
testing Z-direction sums
testing X-Y plane sums
testing X-Z plane sums
testing Y-Z plane sums
success!
==11084== Profiling application: ./t2
==11084== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
2.32676s  478.32ms                    -               -         -         -         -  1.2207GB  2.5521GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
2.80537s  5.2480us                    -               -         -         -         -  39.063KB  7.0985GB/s      Device           -  Quadro K2000 (0         1         7  [CUDA memset]
2.80596s  39.427ms          (10000 1 1)       (256 1 1)        32  1.0000KB        0B         -           -           -           -  Quadro K2000 (0         1         7  void kernel_reduce_sum_3d<float>(float const *, float*, unsigned long, unsigned long, unsigned long, bool, bool, bool) [109]
2.84539s  7.2320us                    -               -         -         -         -  39.063KB  5.1511GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]
2.84586s  1.2160us                    -               -         -         -         -  12.500MB   1e+04GB/s      Device           -  Quadro K2000 (0         1         7  [CUDA memset]
2.84619s  22.838ms          (12800 1 1)       (256 1 1)        32  1.0000KB        0B         -           -           -           -  Quadro K2000 (0         1         7  void kernel_reduce_sum_3d<float>(float const *, float*, unsigned long, unsigned long, unsigned long, bool, bool, bool) [114]
2.86904s  5.8913ms                    -               -         -         -         -  12.500MB  2.0721GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]
2.88707s  1.1840us                    -               -         -         -         -  12.500MB   1e+04GB/s      Device           -  Quadro K2000 (0         1         7  [CUDA memset]
2.88740s  23.046ms          (12800 1 1)       (256 1 1)        32  1.0000KB        0B         -           -           -           -  Quadro K2000 (0         1         7  void kernel_reduce_sum_3d<float>(float const *, float*, unsigned long, unsigned long, unsigned long, bool, bool, bool) [119]
2.91046s  5.5715ms                    -               -         -         -         -  12.500MB  2.1910GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]
2.92758s  2.6240us                    -               -         -         -         -      400B  145.38MB/s      Device           -  Quadro K2000 (0         1         7  [CUDA memset]
2.92762s  40.990ms            (100 1 1)       (256 1 1)        32  1.0000KB        0B         -           -           -           -  Quadro K2000 (0         1         7  void kernel_reduce_sum_3d<float>(float const *, float*, unsigned long, unsigned long, unsigned long, bool, bool, bool) [124]
2.96861s  1.5360us                    -               -         -         -         -      400B  248.35MB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]
2.96898s  2.6240us                    -               -         -         -         -      400B  145.38MB/s      Device           -  Quadro K2000 (0         1         7  [CUDA memset]
2.96900s  43.392ms            (100 1 1)       (256 1 1)        32  1.0000KB        0B         -           -           -           -  Quadro K2000 (0         1         7  void kernel_reduce_sum_3d<float>(float const *, float*, unsigned long, unsigned long, unsigned long, bool, bool, bool) [129]
3.01239s  1.5680us                    -               -         -         -         -      400B  243.28MB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]
3.01277s  1.2160us                    -               -         -         -         -  128.00KB  100.39GB/s      Device           -  Quadro K2000 (0         1         7  [CUDA memset]
3.01279s  23.059ms            (128 1 1)       (256 1 1)        32  1.0000KB        0B         -           -           -           -  Quadro K2000 (0         1         7  void kernel_reduce_sum_3d<float>(float const *, float*, unsigned long, unsigned long, unsigned long, bool, bool, bool) [134]
3.03585s  20.928us                    -               -         -         -         -  128.00KB  5.8329GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

Addressing your questions, briefly:

Question 1. Is there a better way to assign threads and blocks to compute output tensor?

As indicated, your choice of indexing is non-optimal, because it does not result in coalesced access to global memory. The alternate realization I provided will result in coalesced reads from global memory.

Question 2. In my kernel how can I increase occupancy? (which is around 50%)

The figure of merit for this memory bound kernel is not occupancy, but whether or not the kernel runtime is approximately equal to the time it takes to read and write the data. The above kernel should satisfy this. If you satisfy this condition for a memory bound kernel, there is no further improvement possible, regardless of occupancy.

Question 3. How should I use shared memory for global memory read operations? (In case of a large 'dim2', each block should allocate huge shared memory buffer which is not good for concurrent warp execution and occupancy)

Optimal realizations such as what I have shown do not require shared memory at all for case 2 and case 3 (your case), and in case 1, we can craft a kernel that only requires a small amount of shared memory per block; not enough to be an occupancy concern or cause any other problem.