How align works in cuda C?

I am practicing an exercise for Array of Struct (AoS). A struct with/without __align__ has been defined like:

#ifdef TESTALIGN8
struct __align__(8) InnerStruct {
    float x;
    float y;
};
#else
struct InnerStruct {
    float x;
    float y;
};
#endif

The test case is

__global__ void testGpuInnerStruct(InnerStruct *data, InnerStruct *result, const int n) {
    unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < n) {
        result[idx].x = data[idx].x + 10;
        result[idx].y = data[idx].y + 20;
    }
}

The file could be found at gist

Both cases were profiled by ncu-ui on Quadro RTX 4000 and the Memory Workload Analysis is like

Performance without __align__(8)

Performance with __align__(8)

Why L1 hit of latter case is 0%? In my mind, the minimum granularity of load/store is 32 bytes and sizeof(InnerStruct) is 8 bytes with or without __align__(8) qualifier, the InnerStruct.x and InnerStruct.y would always be read in a same load with or without L1 cache. How __align__ impacts the performance like this?

Solution

Why L1 hit of latter case is 0%?

The __align__(8) directive allows the compiler to discover that it can convert 2 separate loads into a single load. The result is that whereas in the non-decorated case, the compiler generates 2 loads, and the 2nd load derives benefit (cache hit rate) from the first load, in the decorated case, there is only one load instruction. Therefore there is no observed cache benefit.

For the non-decorated case, the compiler does something like this:

if (idx < n) {
  //result[idx].x = data[idx].x + 10;
  LDG R0, [result[idx].x];  // pulls two 128-byte L1 cachelines per warp: MISS L1,MISS L2
  FADD R1, R0, 10;
  STG [result[idx].x], R1;  // HIT L2
  //result[idx].y = data[idx].y + 20;
  LDG R0, [result[idx].y];  // benefits from the L1 cache: HIT L1
  FADD R1, R0, 20;
  STG [result[idx].y], R1;  // HIT L2
}

For the decorated case, the compiler does something like:

if (idx < n) {
  LDG.64 R0,[result[idx].x];  // nothing populated cache prior to this: MISS L1,MISS L2
  //result[idx].x = data[idx].x + 10;
  FADD R0, R0, 10;
  //result[idx].y = data[idx].y + 20;
  FADD R1, R1, 20;
  STG.64 [result[idx].x], R0; // HIT L2
}

Thus there is only one load instruction, which does not get any cache benefit.

In the non-decorated case, the compiler cannot assume that the struct is aligned to 8 bytes. It can only assume a 4-byte alignment (the natural alignment for float type). If the struct only has 4 byte alignment (and not 8-byte alignment), then the LDG.64 instruction is not legal, because that instruction requires a "natural" 8-byte alignment. Therefore in the non-decorated case, the compiler must use two 4-byte loads, because it cannot assume 8 byte alignment, whereas in the decorated case, it knows that LDG.64 is legal, and so it uses it instead.

(Aside: I suspect your GPU is not actually a Quadro 4000, but instead maybe a Quadro RTX 4000, because the Quadro 4000 was a fermi-class GPU which is not supported by any recent version of CUDA, much less nsight compute.)

How __align__ works in cuda C?

How align works in cuda C?