Search code examples
cudathrust

structure inside thrust::device_vector


I have a structure as

struct Point
{
 int x;
 int y;
float val;
}

I intend use this struct for representing sparse matrices (I am aware of CUSPARSE and CUSP but I just intend to perform some tests using thrust) and perform the operations using thrust algorithms.

From what I have studied in CUDA programming tutorial(s), it is always advisable to use struct of arrays rather than array of structs for better memory coalescing.

If that is the case, then what if I store non-zeroes (in the order of millions) inside a device_vector using the above mentioned struct ,will this device_vector use unaligned memory access inside GPU while working on thrust algorithms?

I ask this because I might need to access irregular strides inside this device_vector and perform algorithmic operations by passing multiple function objects.

Will it be as efficient as custom kernel which operates on struct of arrays?

Thanks.


Solution

  • NVIDIA CUDA devices can access 4, 8, and 16-byte structures efficiently assuming coalesced memory access patterns. To this end, the CUDA headers define structs int2, int4, float2, float4, etc. that you can use. They are defined to have efficient alignment so instead of your custom Point struct, I recommend using

    typedef int2 Point;
    

    When all memory accesses to arrays of these small structs are sequential across threads in a warp (e.g. coalesced), and all data in each struct element is used by the thread that reads / writes it, then this type of AOS access is very efficient. In fact, using vector structs like this can often lead to higher memory throughput than scalar data accesses due to the increased memory transactions in flight.

    Thrust provides zip_iterator specifically for the convenience and (coding) efficiency of operating on SOA data as if it were AOS data. So while the small structs are efficient in straight CUDA C++, when using Thrust you may instead choose to use a separate device_vector for each struct member, and zip them together using a zip_iterator before calling transform and other thrust algorithms. There are examples of this included with the Thrust sample code.