I'm using CUDA 5.5 and I find a compiler behavior a bit weird, if I try to address a struct which only data is 4 unsigned chars, it triggers four loads of u8. Instead, if I use a union and load a uchar4 it produces the desired nc.v4.u8 load
this code produces ld.global.u8 %rs5, [%r32];
const int wu = 4;
struct data {
uchar_t v[wu];
CUDA_CALLABLE_MEMBER uchar_t &operator[] (int i) {
return v[i];
}
} fetch[rows];
for (int i = 0; i < rows; i++) {
fetch[i] = *((data*)&src[offsetSrc + i*strideSrc]);
}
So I have to address it putting an union for producing the desired: ld.global.nc.v4.u8 {%rs49, %rs50, %rs51, %rs52}, [%r37];
const int wu = 4;
struct data {
union {
uchar_t v[wu];
uchar4 v4;
};
CUDA_CALLABLE_MEMBER uchar_t &operator[] (int i) {
return v[i];
}
} fetch[rows];
for (int i = 0; i < rows; i++) {
fetch[i].v4 = *((uchar4*)&src[offsetSrc + i*strideSrc]);
}
The GPU requires that all data is naturally aligned (i.e. 16-bit data is 16-bit aligned, 32-bit data is 32-bit aligned, 64-bit data is 64-bit aligned, etc). A uchar4 is a struct of four unsigned characters that is 32-bit aligned through the use of an alignment attribute. Therefore it can be loaded with a single 32-bit access. An array of four unsigned chars, on the other hand, is not guaranteed to have 32-bit alignment, and therefore cannot be loaded with a single 32-bit load. A union is aligned based on the strictest alignment required of any of the constituent parts.
User-defined data types can be aligned with the __align__
attribute, which is described in the CUDA Programming Guide