Search code examples
pythoncudagpgpunvcc

CUDA: Unaligned Memory Access Not Supported: What am I missing?


There are a few questions similar to this but in this case, its a bit weird; NVCC 3.1 doesn't like this but 3.2 and 4.0RC do;

float xtmp[MAT1];

for (i=0; i<MAT1; i++){
    xtmp[i]=x[p[i]]; //value that should be here
}

Where p is passed by reference to the function (int *p) coming from...

int p_pivot[MAT1],q_pivot[MAT1];

To add a bit of context, before the p's get to the 'top' function, they are populated by (I'm cutting out as much irrelevant code as i can for clarity)

...
for (i=0;i<MAT1;i++){
    ...
    p_pivot[i]=q_pivot[i]=i
    ...
}
...

Beyond that the only operations on pivot are 3-step-swaps with integer temporary values.

After all that p_pivot is passed to the 'top' function by (&p_pivot[0])

For anyone looking for more detail, the code is here and the only change that should be needed to flip between 3.2/4.0 to earlier is to change the cudaDeviceSynchronise(); to cudaThreadSynchronize();. This is my dirty dirty experimental code so please don't judge me! :D

As noted, all of the above works fine in higher versions of NVCC, and I'm working to get those put onto the machine in question, but I'd be interested to see what I'm missing.

It must be the array-lookup indexing that's causing the issue, but I don't understand why?


Solution

  • That looks like a compiler bug to me. This will work with nvcc 3.1 on 64 bit platforms:

    float xtmp[MAT1];
    //Swap rows (x=Px)
    for (i=0; i<MAT1; i++){
        int idx = p[i];
        xtmp[i]=x[idx]; //value that should be here
    }
    

    My guess is that something in the implicit int to size_t conversion is breaking. Doesn't fail with any of the newer versions of CUDA I have tried.