Is it necessary to call cudaDeviceSynchronize after a CUB class is invoked from a CUDA kernel? When one uses say DeviceReduce::Sum() from the device, there are implicit memory copies that block the device from moving on, but after experiencing some instability with using the following code called on the GPU:
__device__ void calcMonomerFlux(double* fluxes, double* lengths, double* dt) //temp2 temp1
{
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
arrInitToLengths<<< numBlocks, numThreads >>>(lengths);
cudaDeviceSynchronize();
arrMult<<< numBlocks, numThreads >>>(fluxes, lengths, lengths);
cudaDeviceSynchronize();
double sum = 0;
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, lengths, lengths, maxlength);
//cudaDeviceSynchronize();
cudaMalloc(&d_temp_storage, temp_storage_bytes);
//cudaDeviceSynchronize();
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, lengths, lengths, maxlength);
//cudaDeviceSynchronize();
cudaFree(d_temp_storage);
}
Yes, cudaDeviceSynchronize() is required after each CUB call. Note where the commented synch calls are in the question. I spent many hours tracking down why my sums were not calculating correctly or even consistently. Eventually, I found while marching through the NSIGHT debugger that only when I put breakpoints after each CUB function that the calculations would be correct.