Optimizing memory access for complex numbers

I have a kernel that operates on complex numbers, and I am loading the values like this:

thrust::complex<float> x = X[tIdx];

where X is in global memory. When I profile this kernel with nvvp, I find that it is memory bandwidth-limited and the profiler suggests that I improve the memory access pattern:

Global Load L2 Transactions/Access=8, Ideal Transactions/Access=4

The disassembly confirms that this line is indeed split into two 32-bit loads, producing a strided access pattern:

LDG.E R9, [R16];
LDG.E R11, [R16+0x4];

How can I get this to compile into a single 64-bit load?

Potential solutions

I realize this is pretty closely related to this earlier question but the proposed solutions (change the global memory layout or use shared memory) seem less ideal than a 64-bit load.

The NVidia developer blog suggests reinterpret_cast to a vector data type such as float2, but I'm a little hazy about how this fits in with pointer aliasing rules.

I must also confess that this is somewhat of a theoretical question. For this particular kernel, I'm limited by the device memory bandwidth, so halving the # of L2 transactions shouldn't significantly improve the overall performance. But I anticipate working with more complex numbers in my future, and if there's a simple solution then I'd like to start using it now.

Solution

The basic problem here is that the compiler seems to need explicit alignment specifications for a type before it will generate vector load and store instructions. Consider the following trivial example:

class __align__(8) cplx0
{
    public:
    __device__ __host__ cplx0(float _re, float _img) : re(_re), img(_img) {};

    float re, img;
};

class cplx1
{
    public:
    __device__ __host__ cplx1(float _re, float _img) : re(_re), img(_img) {};

    float re, img;
};

template<typename T>
__global__ void memsetkernel(T* out, const T val, int N)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

#pragma unroll 8
    for(; tid < N; tid += stride) out[tid] = val;
}

template<typename T>
__global__ void memcpykernel(const T* __restrict__ in, T* __restrict__ out, int N)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

#pragma unroll 8
    for(; tid < N; tid += stride) out[tid] = in[tid];
}

template<typename T>
void memcpy(const T* in, T* out, int Nitems)
{
    int nthreads = 1024;
    int nblocks = 13 * 2; // GTX 970 with 13 SM

    memcpykernel<T><<<nblocks, nthreads>>>(in, out, Nitems);
    cudaDeviceSynchronize();
}

template<typename T>
void memset(T* in, const T value, int Nitems)
{
    int nthreads = 1024;
    int nblocks = 13 * 2; // GTX 970 with 13 SM

    memsetkernel<T><<<nblocks, nthreads>>>(in, value, Nitems);
    cudaDeviceSynchronize();
}


int main(void)
{
    const int Nitems = 1 << 24;

    typedef cplx0 fcomplex0;
    typedef cplx1 fcomplex1;

    {
        fcomplex0* in;
        fcomplex0* out;
        cudaMalloc((void **)&in, Nitems * sizeof(fcomplex0));
        cudaMalloc((void **)&out, Nitems * sizeof(fcomplex1));

        for(int i=0; i<10; i++) {
            memset<fcomplex0>(in, fcomplex0(1.0f,1.0f), Nitems);
            memcpy<fcomplex0>(in, out, Nitems);
        }
        cudaFree(in);
        cudaFree(out);
    }

    {
        fcomplex1* in;
        fcomplex1* out;
        cudaMalloc((void **)&in, Nitems * sizeof(fcomplex1));
        cudaMalloc((void **)&out, Nitems * sizeof(fcomplex1));

        for(int i=0; i<10; i++) {
            memset<fcomplex1>(in, fcomplex1(1.0f,1.0f), Nitems);
            memcpy<fcomplex1>(in, out, Nitems);
            cudaDeviceSynchronize();
        }
        cudaFree(in);
        cudaFree(out);
    }

    cudaDeviceReset();
    return 0;
}

Here we has two home-baked complex types, one with explicit alignment specifications, and one without. Otherwise they are identical. Putting them through a naïve mempcy and memset kernels in this test harness allows us to inspect the code generation behaviour of the toolchain for each type and benchmark the performance.

Firstly, the code. For cplx0 class, which has explicit 8-byte alignment, the compiler emits vectorized loads and stores in both kernels:

memcpykernel

    ld.global.nc.v2.f32     {%f5, %f6}, [%rd17];
    st.global.v2.f32        [%rd18], {%f5, %f6};

memsetkernel

   st.global.v2.f32        [%rd11], {%f1, %f2};

whereas for the cplx1 case, it does not:

memcpykernel

    ld.global.nc.f32        %f1, [%rd16];
    ld.global.nc.f32        %f2, [%rd16+4];
    st.global.f32   [%rd15+4], %f2;
    st.global.f32   [%rd15], %f1;

memsetkernel

    st.global.f32   [%rd11+4], %f2;
    st.global.f32   [%rd11], %f1;

Looking at performance, there is a non-trivial difference in performance for the memset case (CUDA 8 release toolkit, GTX 970 with Linux 367.48 driver):

$ nvprof ./complex_types
==29074== NVPROF is profiling process 29074, command: ./complex_types
==29074== Profiling application: ./complex_types
==29074== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 33.04%  19.264ms        10  1.9264ms  1.9238ms  1.9303ms  void memcpykernel<cplx1>(cplx1 const *, cplx1*, int)
 32.72%  19.080ms        10  1.9080ms  1.9055ms  1.9106ms  void memcpykernel<cplx0>(cplx0 const *, cplx0*, int)
 19.15%  11.165ms        10  1.1165ms  1.1120ms  1.1217ms  void memsetkernel<cplx1>(cplx1*, cplx1, int)
 15.09%  8.7985ms        10  879.85us  877.67us  884.13us  void memsetkernel<cplx0>(cplx0*, cplx0, int)

The Thrust templated complex type does not have an explicit alignment definition (although it potentially could via specialization, although that would somewhat defeat the purpose). So your only choice here is to either make your own version of the Thrust type with explicit alignment, or use another complex type which does (like the cuComplex type which CUBLAS and CUFFT use).