Efficient way to perform radial average of matrix elements without using shared memory

I am trying to implement radial average of matrix element in CUDA, where I have to find and print average of all neighbouring elements (including itself) of every matrix element. Following is what I ended up with(for radius = 1):

__global__ void matrix_avg(float *ad,float *bd)
{
    float t1, t2, t3, t4, t5, t6, t7, t8, t9, avg;
    int i = blockIdx.y * blockDim.y + threadIdx.y;
    int j = blockIdx.x * blockDim.x + threadIdx.x;
    int k = (i*N)+(j);

    if(i==0)
        {
            if(j==0)
            {
                bd[k]=(ad[k+1]+ad[k+N]+ad[k+N+1]+ad[k])/4;
            }
            else if(j==N-1)
            {
                bd[k]=(ad[k-1]+ad[k+N]+ad[k+N-1]+ad[k])/4;
            }
            else
            {
                bd[k]=(ad[k-1]+ad[k+1]+ad[k+N-1]+ad[k+N]+ad[k+N+1]+ad[k])/6;
            }
        }
        else if(i==N-1)
        {
            if(j==0)
            {
                bd[k]=(ad[k+1]+ad[k-N]+ad[k-N+1]+ad[k])/4;
            }
            else if(j==N-1)
            {
                bd[k]=(ad[k-1]+ad[k-N]+ad[k-N-1]+ad[k])/4;
            }
            else
            {
                bd[k]=(ad[k-1]+ad[k+1]+ad[k-N-1]+ad[k-N]+ad[k-N+1]+ad[k])/6;
            }
        }
        else if(j==0)
        {
            bd[k]=(ad[k-N]+ad[k-N+1]+ad[k+1]+ad[k+N]+ad[k+N+1]+ad[k])/6;
        }
        else if(j==N-1)
        {
            bd[k]=(ad[k-N-1]+ad[k-N]+ad[k-1]+ad[k+N-1]+ad[k+N]+ad[k])/6;
        }
        else
        {
            t1=ad[k-N-1];
            t2=ad[k-N];
            t3=ad[k-N+1];
            t4=ad[k-1];
            t5=ad[k+1];
            t6=ad[k+N-1];
            t7=ad[k+N];
            t8=ad[k+N+1];
            t9=ad[k];
            avg=(t1+t2+t3+t4+t5+t6+t7+t8+t9)/9;
            bd[k]=avg;
        }
}

My above code checks for the conditions of top row, bottom row, right most and left most column elements, for which it has to calculate average of 6 elements. Ans also for 4 corner elements for which it has to calculate average of 4 elements. For remaining inner elements it has to calculate average of 9 elements. Above code is just a simple conversion C to CUDA program. I am looking for most efficient way without using shared memory to write the program., for any given radius. Any algorithm, pseudo code or suggestion will do. Thanks in advance.

Solution

This is how I implemented radial average of matrix elements without using shared memory:

__global__ void matrix_avg(float *ad,float *bd, int radius, int N)
{
    int counter =0,i,j;
    float sum=0.0;

    int globalRow = blockIdx.y * blockDim.y + threadIdx.y;
    int globalCol = blockIdx.x * blockDim.x + threadIdx.x;

    for(i=-radius;i<=radius;i++)
    {
        for(j=-radius;j<=radius;j++)
        {
            if(((globalRow+i)<0) || ((globalCol+j)<0) || ((globalRow+i)>=N) || ((globalCol+j)>=N))
            {
                sum = sum + 0;
            }
            else
            {
                sum = sum + ad[(globalRow+i)*N+(globalCol+j)];
                counter++;
            }
        }
    }
    bd[globalRow*N+globalCol]=sum/counter;
}

ad - Input matrix.

bd - Output matrix

N - Matrix dimension. (I kept it as square matrix for now)

radius - radius to perform range for average calculation