CUDA shared memory programming is not working

all:

I am learning how shared memory accelerates the GPU programming process. I am using the codes below to calculate the squared value of each element plus the squared value of the average of its left and right neighbors. The code runs, however, the result is not as expected.

The first 10 result printed out is 0,1,2,3,4,5,6,7,8,9, while I am expecting the result as 25,2,8, 18,32,50,72,98,128,162;

The code is as follows, with the reference to here;

Would you please tell me which part goes wrong? Your help is very much appreciated.

#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <cuda.h>

const int N=1024;

 __global__ void compute_it(float *data)
 {
 int tid = threadIdx.x;
 __shared__ float myblock[N];
 float tmp;

 // load the thread's data element into shared memory
 myblock[tid] = data[tid];

 // ensure that all threads have loaded their values into
 // shared memory; otherwise, one thread might be computing
 // on unitialized data.
 __syncthreads();

 // compute the average of this thread's left and right neighbors
 tmp = (myblock[tid>0?tid-1:(N-1)] + myblock[tid<(N-1)?tid+1:0]) * 0.5f;
 // square the previousr result and add my value, squared
 tmp = tmp*tmp + myblock[tid]*myblock[tid];

 // write the result back to global memory
 data[tid] = myblock[tid];
 __syncthreads();
  }

int main (){

char key;

float *a;
float *dev_a;

a = (float*)malloc(N*sizeof(float));
cudaMalloc((void**)&dev_a,N*sizeof(float));

for (int i=0; i<N; i++){
a [i] = i;
}


cudaMemcpy(dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice);

compute_it<<<N,1>>>(dev_a);

cudaMemcpy(a, dev_a, N*sizeof(float), cudaMemcpyDeviceToHost);


for (int i=0; i<10; i++){
std::cout<<a [i]<<",";
}

std::cin>>key;

free (a);
free (dev_a);

Solution

One of the most immediate problems in your kernel code is this:

data[tid] = myblock[tid];

I think you probably meant this:

data[tid] = tmp;

In addition, you're launching 1024 blocks of one thread each. This isn't a particularly effective way to use the GPU and it means that your tid variable in every threadblock is 0 (and only 0, since there is only one thread per threadblock.)

There are many problems with this approach, but one immediate problem will be encountered here:

tmp = (myblock[tid>0?tid-1:(N-1)] + myblock[tid<31?tid+1:0]) * 0.5f;

Since tid is always zero, and therefore no other values in your shared memory array (myblock) get populated, the logic in this line cannot be sensible. When tid is zero, you are selecting myblock[N-1] for the first term in the assignment to tmp, but myblock[1023] never gets populated with anything.

It seems that you don't understand various CUDA hierarchies:

a grid is all threads associated with a kernel launch
a grid is composed of threadblocks
each threadblock is a group of threads working together on a single SM
the shared memory resource is a per-SM resource, not a device-wide resource
__synchthreads() also operates on threadblock basis (not device-wide)
threadIdx.x is a built-in variable that provide a unique thread ID for all threads within a threadblock, but not globally across the grid.

Instead you should break your problem into groups of reasonable-sized threadblocks (i.e. more than one thread). Each threadblock will then be able to behave in a fashion that is roughly as you have outlined. You will then need to special-case the behavior at the starting point and ending point (in your data) of each threadblock.

You're also not doing proper cuda error checking which is recommended, especially any time you're having trouble with a CUDA code.

If you make the change I indicated first in your kernel code, and reverse the order of your block and grid kernel launch parameters:

compute_it<<<1,N>>>(dev_a);

As indicated by Kristof, you will get something that comes close to what you want, I think. However you will not be able to conveniently scale that beyond N=1024 without other changes to your code.

This line of code is also not correct:

free (dev_a);

Since dev_a was allocated on the device using cudaMalloc you should free it like this:

cudaFree (dev_a);