here is my code trying to do reduction to find maximum of a 50 value array in a block. I have padded the array to 64.
For threads 1-31 I have correct maxVal printing out but for threads 32-49 it's a completely random number. I dont know what I am doing wrong.
btw. I thought I dont need to _sync every line in unrolling but apparently I have to. any suggestion about that?
Thanks in advance for any help.
//block size = 50
__syncthreads();
if (tid<32){
cptmp[tid]=(cptmp[tid]< cptmp[tid+32]) ? cptmp[tid+32] : cptmp[tid];__syncthreads();
cptmp[tid]=(cptmp[tid]< cptmp[tid+16]) ? cptmp[tid+16] : cptmp[tid];__syncthreads();
cptmp[tid]=(cptmp[tid]< cptmp[tid+8]) ? cptmp[tid+8] : cptmp[tid]; __syncthreads();
cptmp[tid]=(cptmp[tid]< cptmp[tid+4]) ? cptmp[tid+4] : cptmp[tid]; __syncthreads();
cptmp[tid]=(cptmp[tid]< cptmp[tid+2]) ? cptmp[tid+2] : cptmp[tid]; __syncthreads();
cptmp[tid]=(cptmp[tid]< cptmp[tid+1]) ? cptmp[tid+1] : cptmp[tid]; __syncthreads();
}
__syncthreads();
//if (tid==0) {
maxVal=cptmp[0];
if(bix==0 && biy==0) cuPrintf(" max:%f x:%d y:%d\n", maxVal, blockIdx.x, blockIdx.y);
//}
Here is a more efficient (at least on Fermi GPUs) and correct code using volatile. Replace T with your type (or use a template):
if (tid<32) {
volatile T *c = cptmp;
T t = c[tid];
c[tid] = t = (t < c[tid+32]) ? c[tid+32] : t;
c[tid] = t = (t < c[tid+16]) ? c[tid+16] : t;
c[tid] = t = (t < c[tid+ 8]) ? c[tid+ 8] : t;
c[tid] = t = (t < c[tid+ 4]) ? c[tid+ 4] : t;
c[tid] = t = (t < c[tid+ 2]) ? c[tid+ 2] : t;
c[tid] = t = (t < c[tid+ 1]) ? c[tid+ 1] : t;
}
Why is this more efficient? Well, for correctness in the absence of __syncthreads()
we must use a volatile pointer to shared memory. But that forces the compiler to "honor" all reads from and writes to shared memory -- it can't optimize and keep anything in registers. So by explicitly always keeping c[tid]
in the temporary t
, we save one shared memory load per line of code. And since Fermi is a load/store architecture which can only use registers as instruction operands, that means we save an instruction per line, or 6 instructions total (about 25% overall, I expect).
On the old T10/GT200 architecture and earlier, your code (with volatile and no __syncthreads()) would be equally efficient because that architecture could source one operand per instruction directly from shared memory.
This code should be equivalent if you prefer if
over ?:
:
if (tid<32) {
volatile T *c = cptmp;
T t = c[tid];
if (t < c[tid+32]) c[tid] = t = c[tid+32];
if (t < c[tid+16]) c[tid] = t = c[tid+16];
if (t < c[tid+ 8]) c[tid] = t = c[tid+ 8];
if (t < c[tid+ 4]) c[tid] = t = c[tid+ 4];
if (t < c[tid+ 2]) c[tid] = t = c[tid+ 2];
if (t < c[tid+ 1]) c[tid] = t = c[tid+ 1];
}