Does CUDA automatically convert float4 arrays into a struct of arrays?

I have the following snippet of code:

#include <stdio.h>

struct Nonsense {
    float3 group;
    float other;
};

__global__ void coalesced(float4* float4Array, Nonsense* nonsenseArray) {
    float4 someCoordinate = float4Array[threadIdx.x];
    someCoordinate.x = 5;
    float4Array[threadIdx.x] = someCoordinate;

    Nonsense nonsenseValue = nonsenseArray[threadIdx.x];
    nonsenseValue.other = 3;
    nonsenseArray[threadIdx.x] = nonsenseValue;
}

int main() {
    float4* float4Array;
    cudaMalloc(&float4Array, 32 * sizeof(float4));
    cudaMemset(float4Array, 32 * sizeof(float4), 0);

    Nonsense* nonsenseArray;
    cudaMalloc(&nonsenseArray, 32 * sizeof(Nonsense));
    cudaMemset(nonsenseArray, 32 * sizeof(Nonsense), 0);

    coalesced<<<1, 32>>>(float4Array, nonsenseArray);
    cudaDeviceSynchronize();
    return 0;
}

When I run this through the Nvidia profiler in Nsight, and look at the Global Memory Access Pattern, the float4Array has perfect coalesced reads and writes. Meanwhile, the Nonsense array has a poor access patterns (due to it being an array of structs).

Does NVCC automatically convert a float4 array which conceptually is an array of structs into a struct of array for better memory access patterns?

Solution

No, it does not convert it into a struct of arrays. I think if you think about this carefully, you would conclude that it is nearly impossible for the compiler to reorganize data this way. After all, the thing that is being passed is a pointer.

There is only one array, and the elements of that array still have the struct elements in the same order:

float address (i.e. index):      0      1      2      3      4      5 ...
array element             : a[0].x a[0].y a[0].z a[0].w a[1].x a[1].y ...

However the float4 array gives a better pattern because the compiler generates a single 16-byte load per thread. This is sometimes referred to as a "vector load" because we are loading a vector (float4 in this case) per thread. Therefore, adjacent threads are still reading adjacent data, and you have ideal coalescing behavior. In the above example, thread 0 would read a[0].x, a[0].y, a[0].z and a[0].w, thread 1 would read a[1].x, a[1].y etc. All of this would take place in a single request (i.e. SASS instruction) but may be split across multiple transactions. The splitting of a request into multiple transactions does not result in any loss of efficiency (in this case).

In the case of the Nonsense struct, the compiler does not recognize that that struct could also be loaded in a similar fashion, so under the hood it must generate 3 or 4 loads per thread:

one 8-byte load (or two 4-byte loads) to load the first two words of the float3 group
one 4-byte load to load the last word of the float3 group
one 4-byte load to load the float other

If you map out the above loads per thread, perhaps using the above diagram, you will see that each load involves a stride (unused elements between the items loaded per thread) and so results in lower efficiency.

By using careful typecasting or a union definition in your struct, you can get the compiler to load your Nonsense struct in a single load.

This answer also covers some ideas related to AoS -> SoA conversion and the related efficiency gains.

This answer covers vector load details.