OpenMP to CUDA: Reduction

I'm trying to figure out how I can use OpenMP's for reduction() equivalent in CUDA. I've done some research online, and none of what I've tried worked. The code:

    #pragma omp parallel for reduction(+:sum)
    for (i = 0; i < N; i++)
    {
        float f = ...  //store return from function to f
        out[i] = f;    //store f to out[i]
        sum += f;      //add f to sum and store in sum
    }

I know what for reduction() does in OpenMP....it makes the last line of the for loop possible. But how can I use CUDA to express the same thing?

Thanks!

Solution

Use Thrust, An STL inspired library that comes with CUDA. See the Quick Start Guide for examples on how to perform reductions.