Normalize a bunch of vectors using Nvidia's Thrust library

I just learned about Nvidia's thrust library. Just to try it wrote a small example which is supposed to normalize a bunch of vectors.

#include <cstdio>

#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

struct normalize_functor: public thrust::unary_function<double4, double4>
{
    __device__ __host__ double4 operator()(double4 v)
    {
        double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
        v.x /= len;
        v.y /= len;
        v.z /= len;
        printf("%f %f %f\n", v.x, v.y, v.z);
    }
};

int main()
{
    thrust::host_vector<double4> v(2);
    v[0].x = 1; v[0].y = 2; v[0].z = 3;
    v[1].x = 4; v[1].y = 5; v[1].z = 6;

    thrust::device_vector<double4> v_d = v; 
    thrust::for_each(v_d.begin(), v_d.end(), normalize_functor());

    // This doesn't seem to copy back
    v = v_d;

    // Neither this does..
    thrust::host_vector<double4> result = v_d;

    for(int i=0; i<v.size(); i++)
        printf("[ %f %f %f ]\n", result[i].x, result[i].y, result[i].z);

    return 0;
}

The example above seems to work, however I'm unable to copy the data back.. I thought a simple assignment would invoke a cudaMemcpy. It works to copy the data from the host to the device but not back???

Secondly I'm not sure if I do this the right way. The documentation of for_each says:

for_each applies the function object f to each element in the range [first, last); f's return value, if any, is ignored.

But the unary_function struct template expects two template arguments (one for the return value) and forces the operator() to also return a value, this results in a warning when compiling. I don't see how I'm supposed to write an unary functor with no return value.

Next is the data arrangement. I just chose double4 since this will result in two fetch instructions ld.v2.f64 and ld.f64 IIRC. However I'm wondering how thrust fetches data under the hood (and how many cuda threads/blocks) are created. If I would chose instead a struct of 4 vectors would it be able to fetch data in a coalesced way.

Finally thrust provides tuples. What about an array of tuples? How would the data be arranged in this case.

I looked through the examples, but I haven't found an example which explains which data structure to choose for a bunch of vectors (the dot_products_with_zip.cu example says something about "structure of arrays" instead of "arrays of structures" but I see no structures used in the example.

Update

I fixed the code above and tried to run a larger example, this time normalizing 10k vectors.

#include <cstdio>

#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

struct normalize_functor
{
    __device__ __host__ void operator()(double4& v)
    {
        double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
        v.x /= len;
        v.y /= len;
        v.z /= len;
    }
};

int main()
{
    int n = 10000;
    thrust::host_vector<double4> v(n);
    for(int i=0; i<n; i++) {
        v[i].x = rand();
        v[i].y = rand();
        v[i].z = rand();
    }

    thrust::device_vector<double4> v_d = v;

    thrust::for_each(v_d.begin(), v_d.end(), normalize_functor());

    v = v_d;

    return 0;
}

Profiling with computeprof shows me a low occupancy and non-coalesced memory access:

Kernel Occupancy Analysis

Kernel details : Grid size: 23 x 1 x 1, Block size: 448 x 1 x 1
Register Ratio      = 0.984375  ( 32256 / 32768 ) [24 registers per thread] 
Shared Memory Ratio     = 0 ( 0 / 49152 ) [0 bytes per Block] 
Active Blocks per SM        = 3 / 8
Active threads per SM       = 1344 / 1536
Potential Occupancy     = 0.875  ( 42 / 48 )
Max achieved occupancy  = 0.583333  (on 9 SMs)
Min achieved occupancy  = 0.291667  (on 5 SMs)
Occupancy limiting factor   = Block-Size

Memory Throughput Analysis for kernel launch_closure_by_value on device GeForce GTX 470

Kernel requested global memory read throughput(GB/s): 29.21
Kernel requested global memory write throughput(GB/s): 17.52
Kernel requested global memory throughput(GB/s): 46.73
L1 cache read throughput(GB/s): 100.40
L1 cache global hit ratio (%): 48.15
Texture cache memory throughput(GB/s): 0.00
Texture cache hit rate(%): 0.00
L2 cache texture memory read throughput(GB/s): 0.00
L2 cache global memory read throughput(GB/s): 42.44
L2 cache global memory write throughput(GB/s): 46.73
L2 cache global memory throughput(GB/s): 89.17
L2 cache read hit ratio(%): 88.86
L2 cache write hit ratio(%): 3.09
Local memory bus traffic(%): 0.00
Global memory excess load(%): 31.18
Global memory excess store(%): 62.50
Achieved global memory read throughput(GB/s): 4.73
Achieved global memory write throughput(GB/s): 45.29
Achieved global memory throughput(GB/s): 50.01
Peak global memory throughput(GB/s): 133.92

I wonder how I can optimized this?

Solution

If you want to modify a sequence in-place with for_each then you'll need to take the argument by reference in the functor:

struct normalize_functor
{
    __device__ __host__ void operator()(double4& ref)
    {
        double v = ref;
        double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
        v.x /= len;
        v.y /= len;
        v.z /= len;
        printf("%f %f %f\n", v.x, v.y, v.z);
        ref = v;
    }
};

Alternatively, you could use your definition of normalize_functor with the transform algorithm, specifying v_d as both the source and destination range:

thrust::transform(v_d.begin(), v_d.end(), v_d.begin(), normalize_functor());

My personal preference is to use transform in this situation, but the performance ought to be the same in either case.