Heat equation matrix in CUDA - illegal address error

Following this question with reference to the shared memory example in the official guide, I'm trying to build the heat equation matrix, which is just like in this poorly drawn image I made

Here's what I've done so far, minimal example

#define N 32
#define BLOCK_SIZE 16
#define NUM_BLOCKS ((N + BLOCK_SIZE - 1)/ BLOCK_SIZE)

__global__ void heat_matrix(int* A)
{
    const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x; 
    __shared__ int temp_sm_A[N*N];
    int* temp_A = &temp_sm_A[0]; memset(temp_A, 0, N*N*sizeof(int));

    if (tid < N) //(*)
    {
        #pragma unroll
        for (unsigned int m = 0; m < NUM_BLOCKS; ++m)
        {           
            #pragma unroll
            for (unsigned int e = 0; e < BLOCK_SIZE ; ++e) 
            {
                if ( (tid == 0 && e == 0) || (tid == (N-1) && e == (BLOCK_SIZE-1) ) )
                {
                    temp_A[tid + (e + BLOCK_SIZE * m) * N] = -2;
                    temp_A[tid + (e + BLOCK_SIZE * m) * N + ( tid==0 ? 1 : -1 )] = 1;
                }
                if ( tid == e )
                {
                    temp_A[tid + (e + BLOCK_SIZE * m) * N - 1] = 1;
                    //printf("temp_A[%d] = 1;\n", (tid + (e + BLOCK_SIZE * m) * N -1));
                    temp_A[tid + (e + BLOCK_SIZE * m) * N] = -2;
                    //printf("temp_A[%d] = -2;\n", (tid + (e + BLOCK_SIZE * m) * N));
                    temp_A[tid + (e + BLOCK_SIZE * m) * N + 1] = 1;
                    //printf("temp_A[%d] = 1;\n", (tid + (e + BLOCK_SIZE * m) * N +1));
                }
            }
        }
        __syncthreads(); //(**)
        memcpy(A, temp_A, N*N*sizeof(int));
    }
}
int main(){
    int* h_A = (int*)malloc(N*N*sizeof(int)); memset(h_A, 0, N*N*sizeof(int));
    int* d_A; 
    checkCudaErrors(cudaMalloc((void**)&d_A, N*N*sizeof(int)));
    checkCudaErrors(cudaMemcpy(d_A, h_A, N*N*sizeof(int), cudaMemcpyHostToDevice));
    dim3 dim_grid((N/2 + BLOCK_SIZE -1)/ BLOCK_SIZE);
    dim3 dim_block(BLOCK_SIZE);

    heat_matrix <<< dim_grid, dim_block >>> (d_A);
    checkCudaErrors(cudaMemcpy(h_A, d_A, N*N*sizeof(int), cudaMemcpyDeviceToHost));
...
}

The code is arranged to suit a large N (larger than 32). I took advantage of the block division. When executing nvcc yields

CUDA error at matrix.cu:102 code=77(cudaErrorIllegalAddress) "cudaMemcpy(h_A, d_A, N*N*sizeof(int), cudaMemcpyDeviceToHost)"

And cuda-memcheck provides only one error (actually there is another, but it comes from cudasuccess=checkCudaErrors(cudaDeviceReset()); ...)

========= CUDA-MEMCHECK
========= Invalid __shared__ write of size 4
=========     at 0x00000cd0 in heat_matrix(int*)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0xfffffffc is out of bounds
...

I can't see where I did wrong in the code. How can the thread 0 in the first block provoke an illegal access? There's even the specific if case to deal with it, and there isn't reported the line of the code in which the error occurred.

Moreover, is there a more efficient way for my code than to deal with all those ifs? Sure there is, but I couldn't find a better parallel expression to split the cases into the second for.

On a side note, to me the (*) seems unnecessary; instead (**) is necessary if I want to follow with other GPU function calls. Am I right?

Solution

Check out this line:
```
            temp_A[tid + (e + BLOCK_SIZE * m) * N - 1] = 1;
```
For the thread with tid equal to zero during the first iteration, tid + (e + BLOCK_SIZE * m) * N - 1 evaluates to an index of -1. This is exactly what the cuda-memcheck output is complaining about (with the address having wrapped around due to underflow).
A similar out-of-bounds access will occur later for the line
```
            temp_A[tid + (e + BLOCK_SIZE * m) * N + 1] = 1;
```
when tid, e, and m all assume their maximum value.
You have multiple threads writing to the same memory location. Each thread should write to exactly one array element per inner loop iteration. There is no need to write out neighboring elements because they are already covered by their own threads.
You have a race condition between the initializing memset() and the stores inside the main loops. Put a syncthreads() after the memset().
The calls to memset() and memcpy() will lead to each thread doing a full set/copy, doing the operations N times instead of just once.
The common way of handling this is to write out the operation explicitly, dividing the work between the threads of the block.
However ...
there is no benefit from creating the matrix in shared memory first and then copying it to global memory later. Writing directly to A in global memory eliminates the need for memset(), memcpy() and syncthreads() altogether.
Using a block size of just 16 threads leaves half of the resources unused, as thread blocks are allocated in units of 32 threads (a warp).

You may want to re-read the section about the Thread Hierarchy in the CUDA C Programming Guide.