Using cudaDeviceSynchronize after a CUB class

Is it necessary to call cudaDeviceSynchronize after a CUB class is invoked from a CUDA kernel? When one uses say DeviceReduce::Sum() from the device, there are implicit memory copies that block the device from moving on, but after experiencing some instability with using the following code called on the GPU:

__device__ void calcMonomerFlux(double* fluxes, double* lengths, double* dt) //temp2 temp1
{

    void *d_temp_storage = NULL;
    size_t temp_storage_bytes = 0;

    arrInitToLengths<<< numBlocks, numThreads >>>(lengths); 
    cudaDeviceSynchronize();
    arrMult<<< numBlocks, numThreads >>>(fluxes, lengths, lengths);
    cudaDeviceSynchronize();
    double sum = 0;

    cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, lengths, lengths, maxlength);
    //cudaDeviceSynchronize();

    cudaMalloc(&d_temp_storage, temp_storage_bytes);
    //cudaDeviceSynchronize();

    cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, lengths, lengths, maxlength);
    //cudaDeviceSynchronize();

    cudaFree(d_temp_storage);

}

Solution

Yes, cudaDeviceSynchronize() is required after each CUB call. Note where the commented synch calls are in the question. I spent many hours tracking down why my sums were not calculating correctly or even consistently. Eventually, I found while marching through the NSIGHT debugger that only when I put breakpoints after each CUB function that the calculations would be correct.