How does CUDA's nppiMalloc... function guarantee alignment?

Something that's been confusing me for a while is the alignment requirement of allocated CUDA memories. I know that if they are aligned, accessing row elements will be much more efficient.

First a little background:

According to CUDA C Programming Guide (section 5.3.2):

Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions. These memory transactions must be naturally alignedOnly the 32-, 64-, or 128-byte segments of device memory that are aligned to their size (i.e., whose first address is a multiple of their size) can be read or written by memory transactions.

My understanding is that for a 2D interleaved array of type T, (say pixel values in R,G,B order), if numChannels * sizeof(T) is either 4, 8 or 16, then the array has to be allocated using cudaMallocPitch if performance is a necessity. So far this has been working fine for me. I'd check numChannels * sizeof(T) before allocating a 2D array and if it is 4, 16 or 32, I allocate it using cudaMallocPitch and everything works.

Now the question:

I've realized that when using NVIDIA's NPP library, there is a family of allocator functions (nppiMalloc... like nppiMalloc_32f_C1 and so on). NVIDIA has recommended using these functions for performance. My question is that, how are these functions guaranteeing the alignment? More specifically, what kind of math are they using to come up with a suitable value for pitch?

For a single channel 512x512 pixel image (with float pixel values in the range [0, 1]) I've used both cudaMallocPitch and nppiMalloc_32f_C1.
cudaMallocPitch gave me a pitch value of 2048 while nppiMalloc_32f_C1 gave me 2560. Where is the latter number coming from and how exactly is that?

Why I care about this
I'm writing a synced memory class template for synchronizing values on GPU and CPU. This class is supposed to be taking care of allocating pitched memories (if possible) under the hood. Since I want this class to be interoperable with NVIDIA's NPP, I'd like to handle all allocations in a way that would provide good performance for CUDA kernels as well as NPP operations.
My impression was that nppiMalloc was calling cudaMallocPitch under the hood, but it seems that I'm wrong.

Solution

An interesting question. However, there may be no definitive answer at all, for several reasons: The implementation of these methods is not publicly available. One has to assume that NVIDIA uses some special tricks and tweaks internally. Moreover: The resulting pitch is not specified. So one has to assume that it might change between several releases of CUDA/NPP. Particularly, it's not unlikely that the actual pitch will depend on the hardware version (the "Compute Capability") of the device that the method is executed on.

Nevertheless, I was curious about this and wrote the following test:

#include <stdio.h>
#include <npp.h>

template <typename T>
void testStepBytes(const char* name, int elementSize, int numComponents, 
    T (*allocator)(int, int, int*))
{
    printf("%s\n", name);
    int dw = 1;
    int prevStepBytes = 0;
    for (int w=1; w<2050; w+=dw)
    {
        int stepBytes;
        void *p = allocator(w, 1, &stepBytes);
        nppiFree(p);
        if (stepBytes != prevStepBytes)
        {
            printf("Stride %5d is used up to w=%5d (%6d bytes)\n", 
                prevStepBytes, (w-dw), (w-dw)*elementSize*numComponents);
            prevStepBytes = stepBytes;
        }
    }
}

int main(int argc, char *argv[])
{
    testStepBytes("nppiMalloc_8u_C1", 1, 1, &nppiMalloc_8u_C1);
    testStepBytes("nppiMalloc_8u_C2", 1, 2, &nppiMalloc_8u_C2);
    testStepBytes("nppiMalloc_8u_C3", 1, 3, &nppiMalloc_8u_C3);
    testStepBytes("nppiMalloc_8u_C4", 1, 4, &nppiMalloc_8u_C4);

    testStepBytes("nppiMalloc_16u_C1", 2, 1, &nppiMalloc_16u_C1);
    testStepBytes("nppiMalloc_16u_C2", 2, 2, &nppiMalloc_16u_C2);
    testStepBytes("nppiMalloc_16u_C3", 2, 3, &nppiMalloc_16u_C3);
    testStepBytes("nppiMalloc_16u_C4", 2, 4, &nppiMalloc_16u_C4);

    testStepBytes("nppiMalloc_32f_C1", 4, 1, &nppiMalloc_32f_C1);
    testStepBytes("nppiMalloc_32f_C2", 4, 2, &nppiMalloc_32f_C2);
    testStepBytes("nppiMalloc_32f_C3", 4, 3, &nppiMalloc_32f_C3);
    testStepBytes("nppiMalloc_32f_C4", 4, 4, &nppiMalloc_32f_C4);

    return 0;
}

The pitch (stepBytes) seemed to solely depend on the width of the image. So this program allocates memory for images of different types, with an increasing width, and prints information about the maximum image sizes that result in a particular stride. The intention was to derive a pattern or a rule - namely the "kind of math" that you asked about.

The results are ... somewhat confusing. For example, for the nppiMalloc_32f_C1 call, on my machine (CUDA 6.5, GeForce GTX 560 Ti, Compute Capability 2.1), it prints:

nppiMalloc_32f_C1
Stride     0 is used up to w=    0 (     0 bytes)
Stride   512 is used up to w=  120 (   480 bytes)
Stride  1024 is used up to w=  248 (   992 bytes)
Stride  1536 is used up to w=  384 (  1536 bytes)
Stride  2048 is used up to w=  504 (  2016 bytes)
Stride  2560 is used up to w=  640 (  2560 bytes)
Stride  3072 is used up to w=  768 (  3072 bytes)
Stride  3584 is used up to w=  896 (  3584 bytes)
Stride  4096 is used up to w= 1016 (  4064 bytes)
Stride  4608 is used up to w= 1152 (  4608 bytes)
Stride  5120 is used up to w= 1280 (  5120 bytes)
Stride  5632 is used up to w= 1408 (  5632 bytes)
Stride  6144 is used up to w= 1536 (  6144 bytes)
Stride  6656 is used up to w= 1664 (  6656 bytes)
Stride  7168 is used up to w= 1792 (  7168 bytes)
Stride  7680 is used up to w= 1920 (  7680 bytes)
Stride  8192 is used up to w= 2040 (  8160 bytes)

confirming that for an image with width=512, it will use a stride of 2560. The expected stride of 2048 would be used for an image up to width=504.

The numbers seem a bit odd, so I ran another test for nppiMalloc_8u_C1 in order to cover all possible image line sizes (in bytes), with larger image sizes, and noticed a strange pattern: The first increase of the pitch size (from 512 to 1024) occurred when the image was larger than 480 bytes, and 480=512-32. The next step (from 1024 to 1536) occurred when the image was larger than 992 bytes, and 992=480+512. The next step (from 1536 to 2048) occurred when the image was larger than 1536 bytes, and 1536=992+512+32. From there, it seemed to mostly run in steps of 512, except for several sizes in between. The further steps are summarized here:

nppiMalloc_8u_C1
Stride      0 is used up to w=     0 (     0 bytes, delta     0)
Stride    512 is used up to w=   480 (   480 bytes, delta   480)
Stride   1024 is used up to w=   992 (   992 bytes, delta   512)
Stride   1536 is used up to w=  1536 (  1536 bytes, delta   544)
Stride   2048 is used up to w=  2016 (  2016 bytes, delta   480) \
Stride   2560 is used up to w=  2560 (  2560 bytes, delta   544) | 4
Stride   3072 is used up to w=  3072 (  3072 bytes, delta   512) |
Stride   3584 is used up to w=  3584 (  3584 bytes, delta   512) /
Stride   4096 is used up to w=  4064 (  4064 bytes, delta   480)     \
Stride   4608 is used up to w=  4608 (  4608 bytes, delta   544)     |
Stride   5120 is used up to w=  5120 (  5120 bytes, delta   512)     |
Stride   5632 is used up to w=  5632 (  5632 bytes, delta   512)     | 8
Stride   6144 is used up to w=  6144 (  6144 bytes, delta   512)     |
Stride   6656 is used up to w=  6656 (  6656 bytes, delta   512)     |
Stride   7168 is used up to w=  7168 (  7168 bytes, delta   512)     |
Stride   7680 is used up to w=  7680 (  7680 bytes, delta   512)     /
Stride   8192 is used up to w=  8160 (  8160 bytes, delta   480) \
Stride   8704 is used up to w=  8704 (  8704 bytes, delta   544) |
Stride   9216 is used up to w=  9216 (  9216 bytes, delta   512) |
Stride   9728 is used up to w=  9728 (  9728 bytes, delta   512) |
Stride  10240 is used up to w= 10240 ( 10240 bytes, delta   512) |
Stride  10752 is used up to w= 10752 ( 10752 bytes, delta   512) |
Stride  11264 is used up to w= 11264 ( 11264 bytes, delta   512) |
Stride  11776 is used up to w= 11776 ( 11776 bytes, delta   512) | 16
Stride  12288 is used up to w= 12288 ( 12288 bytes, delta   512) |
Stride  12800 is used up to w= 12800 ( 12800 bytes, delta   512) |
Stride  13312 is used up to w= 13312 ( 13312 bytes, delta   512) |
Stride  13824 is used up to w= 13824 ( 13824 bytes, delta   512) |
Stride  14336 is used up to w= 14336 ( 14336 bytes, delta   512) |
Stride  14848 is used up to w= 14848 ( 14848 bytes, delta   512) |
Stride  15360 is used up to w= 15360 ( 15360 bytes, delta   512) |
Stride  15872 is used up to w= 15872 ( 15872 bytes, delta   512) /
Stride  16384 is used up to w= 16352 ( 16352 bytes, delta   480)     \
Stride  16896 is used up to w= 16896 ( 16896 bytes, delta   544)     |
Stride  17408 is used up to w= 17408 ( 17408 bytes, delta   512)     |
...                                                                ... 32
Stride  31232 is used up to w= 31232 ( 31232 bytes, delta   512)     |
Stride  31744 is used up to w= 31744 ( 31744 bytes, delta   512)     |
Stride  32256 is used up to w= 32256 ( 32256 bytes, delta   512)     /
Stride  32768 is used up to w= 32736 ( 32736 bytes, delta   480) \
Stride  33280 is used up to w= 33280 ( 33280 bytes, delta   544) |
Stride  33792 is used up to w= 33792 ( 33792 bytes, delta   512) |
Stride  34304 is used up to w= 34304 ( 34304 bytes, delta   512) |
...                                                            ... 64
Stride  64512 is used up to w= 64512 ( 64512 bytes, delta   512) |
Stride  65024 is used up to w= 65024 ( 65024 bytes, delta   512) /
Stride  65536 is used up to w= 65504 ( 65504 bytes, delta   480)     \
Stride  66048 is used up to w= 66048 ( 66048 bytes, delta   544)     |   
Stride  66560 is used up to w= 66560 ( 66560 bytes, delta   512)     |
Stride  67072 is used up to w= 67072 ( 67072 bytes, delta   512)     |
....                                                               ... 128
Stride 130048 is used up to w=130048 (130048 bytes, delta   512)     |
Stride 130560 is used up to w=130560 (130560 bytes, delta   512)     /
Stride 131072 is used up to w=131040 (131040 bytes, delta   480) \
Stride 131584 is used up to w=131584 (131584 bytes, delta   544) |
Stride 132096 is used up to w=132096 (132096 bytes, delta   512) |
...                                                              | guess...

There obviously is a pattern. The pitches are related to multiples of 512. For sizes of 512*2ⁿ, with n being a whole number, there are some odd -32 and +32 offsets for the size limits that cause a larger pitch to be used.

Maybe I'll have another look on this. I'm pretty sure that one could derive a formula covering this odd progression of the pitch. But again: This may depend on the underlying CUDA version, the NPP version, or even the Compute Capability of the card that is used.

_{And, just for completeness: It might also be the case that this strange pitch size simply is a bug in NPP. You never know.}