Search code examples
cudagpgpugpuimagegpujcuda

Efficient way to perform radial average of matrix elements without using shared memory


I am trying to implement radial average of matrix element in CUDA, where I have to find and print average of all neighbouring elements (including itself) of every matrix element. Following is what I ended up with(for radius = 1):

__global__ void matrix_avg(float *ad,float *bd)
{
    float t1, t2, t3, t4, t5, t6, t7, t8, t9, avg;
    int i = blockIdx.y * blockDim.y + threadIdx.y;
    int j = blockIdx.x * blockDim.x + threadIdx.x;
    int k = (i*N)+(j);

    if(i==0)
        {
            if(j==0)
            {
                bd[k]=(ad[k+1]+ad[k+N]+ad[k+N+1]+ad[k])/4;
            }
            else if(j==N-1)
            {
                bd[k]=(ad[k-1]+ad[k+N]+ad[k+N-1]+ad[k])/4;
            }
            else
            {
                bd[k]=(ad[k-1]+ad[k+1]+ad[k+N-1]+ad[k+N]+ad[k+N+1]+ad[k])/6;
            }
        }
        else if(i==N-1)
        {
            if(j==0)
            {
                bd[k]=(ad[k+1]+ad[k-N]+ad[k-N+1]+ad[k])/4;
            }
            else if(j==N-1)
            {
                bd[k]=(ad[k-1]+ad[k-N]+ad[k-N-1]+ad[k])/4;
            }
            else
            {
                bd[k]=(ad[k-1]+ad[k+1]+ad[k-N-1]+ad[k-N]+ad[k-N+1]+ad[k])/6;
            }
        }
        else if(j==0)
        {
            bd[k]=(ad[k-N]+ad[k-N+1]+ad[k+1]+ad[k+N]+ad[k+N+1]+ad[k])/6;
        }
        else if(j==N-1)
        {
            bd[k]=(ad[k-N-1]+ad[k-N]+ad[k-1]+ad[k+N-1]+ad[k+N]+ad[k])/6;
        }
        else
        {
            t1=ad[k-N-1];
            t2=ad[k-N];
            t3=ad[k-N+1];
            t4=ad[k-1];
            t5=ad[k+1];
            t6=ad[k+N-1];
            t7=ad[k+N];
            t8=ad[k+N+1];
            t9=ad[k];
            avg=(t1+t2+t3+t4+t5+t6+t7+t8+t9)/9;
            bd[k]=avg;
        }
}

My above code checks for the conditions of top row, bottom row, right most and left most column elements, for which it has to calculate average of 6 elements. Ans also for 4 corner elements for which it has to calculate average of 4 elements. For remaining inner elements it has to calculate average of 9 elements. Above code is just a simple conversion C to CUDA program. I am looking for most efficient way without using shared memory to write the program., for any given radius. Any algorithm, pseudo code or suggestion will do. Thanks in advance.


Solution

  • This is how I implemented radial average of matrix elements without using shared memory:

    __global__ void matrix_avg(float *ad,float *bd, int radius, int N)
    {
        int counter =0,i,j;
        float sum=0.0;
    
        int globalRow = blockIdx.y * blockDim.y + threadIdx.y;
        int globalCol = blockIdx.x * blockDim.x + threadIdx.x;
    
        for(i=-radius;i<=radius;i++)
        {
            for(j=-radius;j<=radius;j++)
            {
                if(((globalRow+i)<0) || ((globalCol+j)<0) || ((globalRow+i)>=N) || ((globalCol+j)>=N))
                {
                    sum = sum + 0;
                }
                else
                {
                    sum = sum + ad[(globalRow+i)*N+(globalCol+j)];
                    counter++;
                }
            }
        }
        bd[globalRow*N+globalCol]=sum/counter;
    }
    

    ad - Input matrix.

    bd - Output matrix

    N - Matrix dimension. (I kept it as square matrix for now)

    radius - radius to perform range for average calculation