Search code examples
c++memory-managementsycldpc++

Is it more efficient in SYCL to use one buffer or multiple buffers?


Suppose I have an array of data, for example an array of 3D vectors of size N. Suppose that each iteration of my SYCL kernel is exclusively or primarily concerned with only one vector. Which of the following ways of breaking this into contiguous buffers is, as a general rule, more efficient--or does it matter?

I realize the targeted device affects this a lot, so let's assume it's a discrete GPU (i.e. the data really does have to be copied to a different memory chip and the device doesn't have some crazy architecture like an FPGA--I'm mainly targeting a GTX 1080 via CUDA, but I expect the answer is likely similar when the code is compiling to OpenCL or we're using another modern GPU.

  1. Create a separate buffer for each coordinate, e.g. sycl::buffer<float> x, y, z;, each of size N. This way when accessing them I can use the sycl::id<1> passed to my kernel lambda as the index with no math. (I suspect the compiler may be able to optimize this.)
  2. Create one packed buffer for all of them, e.g. sycl::buffer<float> coords; with size 3N. When accessing them with a sycl::id<1> called i, I then grab the x coordinate as buffer_accessor[3*i], the y coordinate as buffer_accessor[3*i+1], and the z coordinate as buffer_accessor[3*i+2]. (I don't know if the compiler can optimize this, and I'm not sure if alignment issues might come into play.)
  3. Create one unpacked buffer using a struct, e.g. struct Coord { float x,y,z; }; sycl::buffer<Coord> coords;. This has the rather alarming cost of increasing memory usage, in this example by 33%, because of alignment padding--which will also increase the time required to copy the buffer to the device. But the tradeoff is that you can access the data without manipulating the sycl::id<1>, the runtime only has to deal with one buffer, and there shouldn't be any cache line alignment inefficiencies on the device.
  4. Use a two-dimensional buffer of size (N,3) and iterate only over the range of the first dimension. This is a less flexible solution and I don't see why I'd want to use multidimensional buffers when I'm not iterating over all the dimensions, unless there's a lot of optimization built in for this use case.

I cannot find any guidelines on data architecture to get an intuition for this sort of thing. Right now (4) seems silly, (3) involves unacceptable memory waste, and I'm using (2) but wondering if I mightn't should be using (1) instead to avoid the id manipulation and 3*sizeof(float) aligned access chunks.


Solution

  • For memory access patterns on GPUs, it is first important to understand the concept of coalescing. Basically it means that in certain conditions, the device will merge memory accesses of adjacent work items and instead emit one large memory access. This is very important for performance. The detailed requirements when coalescing occurs vary between GPU vendors (or even between GPU generations of one vendor). But usually, the requirements tend to be along the lines of

    • A certain number of adjacent work items accesses adjacent data elements. E.g. all work items in a SYCL subgroup / CUDA warp access subsequent data elements.
    • The data element accessed by the first work item might have to be aligned, e.g. to a cache line.

    See here an explanation for (older) NVIDIA GPUs: https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/

    With that in mind, 3) not only wastes memory capacity, but also memory bandwidth, and if you have something like my_accessor[id].x you have a strided memory access which prevents coalescing.

    For 4), I'm not sure if I understand correctly. I assume that you mean that the dimension with 3 elements controls whether you access x/y/z and the one with N describes the n-th vector. In that case it would depend whether you have size (N, 3) or (3, N). Because in SYCL the data layout is such that the last index is always the fastest, (N, 3) would in practice correspond to 3) without the padding issue. (3, N) would be similar to 2) but without strided memory access (see below)

    For 2), the main performance issue is that you are performing a strided memory access if x is at [3*i], y at [3*i+1] etc. For coalescing you instead want x to be at [i], y at [N+i] and z at [2N+i]. If you have something like

    float my_x = data[i]; // all N work items perform coalesced access for x
    float my_y = data[i+N];
    float my_z = data[i+2N];
    

    You have a nice memory access pattern. Depending on your choice of N and the alignment requirements for coalesced memory accesses of your device, you might have performance issues for y and z because of the alignment.

    I don't expect that the fact that you need to add offsets to your index substantially affects performance.

    For 1) you would mainly gain a guarantee that all data is nicely aligned and that access will coalesce. Because of this, I would expect this to perform best of the presented approaches.

    From the perspective of the SYCL runtime, in general there are both advantages and disadvantages to using a single large buffer vs multiple smaller ones (e.g. overhead of many buffers, but more opportunities for task graph optimization strategies). I expect those effects to be secondary.