Why static declared array of 4 unsigned chars produces ld.global.u8 when fetching memory?

I'm using CUDA 5.5 and I find a compiler behavior a bit weird, if I try to address a struct which only data is 4 unsigned chars, it triggers four loads of u8. Instead, if I use a union and load a uchar4 it produces the desired nc.v4.u8 load

this code produces ld.global.u8 %rs5, [%r32];

        const int wu = 4;
        struct data {
            uchar_t v[wu];           
            CUDA_CALLABLE_MEMBER uchar_t &operator[] (int i) {
                return v[i];
            }
        } fetch[rows];

        for (int i = 0; i < rows; i++) {
            fetch[i] = *((data*)&src[offsetSrc + i*strideSrc]);
        }

So I have to address it putting an union for producing the desired: ld.global.nc.v4.u8 {%rs49, %rs50, %rs51, %rs52}, [%r37];

       const int wu = 4;
       struct data {
            union {
                uchar_t v[wu];
                uchar4 v4;
            };
            CUDA_CALLABLE_MEMBER uchar_t &operator[] (int i) {
                return v[i];
            }
        } fetch[rows];

        for (int i = 0; i < rows; i++) {
            fetch[i].v4 = *((uchar4*)&src[offsetSrc + i*strideSrc]);
        }

Solution

The GPU requires that all data is naturally aligned (i.e. 16-bit data is 16-bit aligned, 32-bit data is 32-bit aligned, 64-bit data is 64-bit aligned, etc). A uchar4 is a struct of four unsigned characters that is 32-bit aligned through the use of an alignment attribute. Therefore it can be loaded with a single 32-bit access. An array of four unsigned chars, on the other hand, is not guaranteed to have 32-bit alignment, and therefore cannot be loaded with a single 32-bit load. A union is aligned based on the strictest alignment required of any of the constituent parts.

User-defined data types can be aligned with the __align__ attribute, which is described in the CUDA Programming Guide