c++dynamic-memory-allocation simd memory-alignment micro-optimization

Allocating memory aligned buffers for SIMD; how does |16 give an odd multiple of 16, and why do it?

I am working on a C++ function to allocate multiple buffers in memory. The buffers have to be N-byte aligned since the data they hold will be processed with various types of SIMD instruction sets (SSE, AVX, AVX512, etc...)

In the Apple Core Audio Utility Classes online I found this piece of code:

void CABufferList::AllocateBuffers(UInt32 nBytes)
{
    if (nBytes <= GetNumBytes()) return;

    if (mABL.mNumberBuffers > 1) {
        // align successive buffers for Altivec and to take alternating
        // cache line hits by spacing them by odd multiples of 16
        nBytes = ((nBytes + 15) & ~15) | 16;
    }
    UInt32 memorySize = nBytes * mABL.mNumberBuffers;
    Byte *newMemory = new Byte[memorySize], *p = newMemory;
    memset(newMemory, 0, memorySize);   // get page faults now, not later

    AudioBuffer *buf = mABL.mBuffers;
    for (UInt32 i = mABL.mNumberBuffers; i--; ++buf) {
        if (buf->mData != NULL && buf->mDataByteSize > 0) {
            // preserve existing buffer contents
            memcpy(p, buf->mData, buf->mDataByteSize);
        }
        buf->mDataByteSize = nBytes;
        buf->mData = p;
        p += nBytes;
    }
    Byte *oldMemory = mBufferMemory;
    mBufferMemory = newMemory;
    mBufferCapacity = nBytes;
    delete[] oldMemory;
}

The code is pretty straight forward however there is one line that I just don't fully grasp:

nBytes = ((nBytes + 15) & ~15) | 16;

I understand it's aligning/quantizing the number of bytes to 16, however I don't understand why it's using a bitwise OR 16 at the end. The comment says: "to take alternating cache line hits by spacing them by odd multiples of 16". Excuse my thickness, but I still don't get it.

So I have three questions:

1) What does | 16; do exactly and why is it done?

2) Considering the context of memory allocation and data access, how and in what terms does | 16; improve the code? From the comments in the code I can guess it is related to cache access, but I don't understand the whole "alternating cache line hits" bit. How does spacing the memory allocation addresses by odd multiples of 16 improve on cache access?

3) Am I right thinking that the above function will only work correctly based on the assumption that the new operator will return at least 16-byte aligned memory? In C++ the new operator is defined as returning a pointer to storage with alignment suitable for any object with a fundamental alignment requirement, which might not necessarily be 16 bytes.

Solution

Disclaimer

Based on the comment referring to Altivec, this is specific to the Power architecture, which I'm not familiar with. Also, the code is incomplete, but it looks like the allocated memory is organized in one or multiple adjacent buffers, and the size adjustment only works when there are multiple buffers. We don't know how data is accessed in these buffers. There will be a lot of assumptions in this answer, to the point that it may be totally incorrect. I'm posting it mostly because it's too large for a comment.

Answer (sort of)

I can see one possible advantage of the size modification. First, let's remember some details about Power architecture:

Altivec vector size is 16 bytes (128 bits)
Cache line size is 128 bytes

Now, let's take an example that AllocateBuffers allocates memory for 4 buffers (i.e. mABL.mNumberBuffers is 4) and nBytes is 256. Let's see how these buffers are laid out in memory:

| Buffer 1: 256+16=272 bytes | Buffer 2: 272 bytes | Buffer 3: 272 bytes | Buffer 4: 272 bytes |
^                            ^                     ^                     ^
|                            |                     |                     |
offset: 0                    272                   544                   816

Notice the offset values and compare them against cache line boundaries. For simplicity, let's assume the memory is allocated at the cache line boundary. It doesn't really matter, as will be shown below.

Buffer 1 starts at offset 0, which is the beginning of a cache line.
Buffer 2 starts 16 bytes past the cache line boundary (which is at offset 2*128=256).
Buffer 3 starts 32 bytes past the cache line boundary (which is at offset 4*128=512).
Buffer 4 starts 48 bytes past the cache line boundary (which is at offset 6*128=768).

Note how the offset from the nearest cache line boundary increments by 16 bytes. Now, if we assume that data in each of the buffers will be accessed in 16-byte chunks, in forward direction, in a loop then the cache lines are fetched from memory in a rather specific order. Let's consider the middle of the loop (since in the beginning CPU will have to fetch cache lines for the beginning of every buffer):

Iteration 5
- Load from Buffer 1 at offset 5*16=80, we are still using the cache line that was fetched on previous iterations.
- Load from Buffer 2 at offset 352, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 256, we're at its offset 96.
- Load from Buffer 3 at offset 624, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 512, we're at its offset 112.
- Load from Buffer 4 at offset 896, we hit a new cache line boundary and fetch a new cache line from memory.
Iteration 6
- Load from Buffer 1 at offset 6*16=96, we are still using the cache line that was fetched on previous iterations.
- Load from Buffer 2 at offset 368, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 256, we're at its offset 112.
- Load from Buffer 3 at offset 640, we hit a new cache line boundary and fetch a new cache line from memory.
- Load from Buffer 4 at offset 896, we are still using the cache line that was fetched on the last iteration. The cache line boundary is at offset 896, we're at its offset 16.
Iteration 7
- Load from Buffer 1 at offset 7*16=112, we are still using the cache line that was fetched on previous iterations.
- Load from Buffer 2 at offset 384, we hit a new cache line boundary and fetch a new cache line from memory.
- Load from Buffer 3 at offset 656, we are still using the cache line that was fetched on the last iteration. The cache line boundary is at offset 640, we're at its offset 16.
- Load from Buffer 4 at offset 912, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 896, we're at its offset 32.
Iteration 8
- Load from Buffer 1 at offset 8*16=128, we hit a new cache line boundary and fetch a new cache line from memory.
- Load from Buffer 2 at offset 400, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 384, we're at its offset 16.
- Load from Buffer 3 at offset 672, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 640, we're at its offset 32.
- Load from Buffer 4 at offset 944, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 896, we're at its offset 48.

Note that the order in which new cache lines are fetched from memory does not depend on the order of accessing buffers within each loop iteration. Also, it does not depend on whether the whole memory allocation was aligned to a cache line boundary. Also note that if buffer contents were accessed in reverse order then the cache lines would be fetched in forward order, but still in order.

This ordered cache line fetching may aid hardware prefercher in the CPU, so, when the next loop iteration is executed, the required cache line is already pre-fetched. Without it, every 8th iteration of the loop would require 4 new cache lines in whatever order the buffers are accessed by the program, which could be interpreted as random access to memory and hamper the prefetcher. Depending on the loop complexity, this 4 cache line fetch may not be hidden by the out-of-order execution model and introduce a stall. This is less likely to happen when you only fetch up to 1 cache line per iteration.

Another possible benefit is avoiding address aliasing. I don't know cache organization of Power, but if nBytes is a multiple of a page size, using multiple buffers at once, when each buffer is page-aligned, could result in lots of false dependencies and hamper store-to-load forwarding. Though the code does the adjustment not just in case when nBytes is a multiple of a page size, so aliasing probably was not the main concern.

Am I right thinking that the above function will only work correctly based on the assumption that the new operator will return at least 16-byte aligned memory? In C++ the new operator is defined as returning a pointer to storage with alignment suitable for any object with a fundamental alignment requirement, which might not necessarily be 16 bytes.

Yes, C++ does not guarantee any particular alignment, other than it is suitable for storing any object of fundamental type. C++17 adds support for dynamic allocations for over-aligned types.

However, even with older C++ versions, every compiler also adheres to the target system ABI specification, which may specify alignment for memory allocations. In practice, on many systems malloc returns at least 16-byte aligned pointers and operator new uses memory returned by malloc or similar lower level API.

It's not portable though, and therefore not a recommended practice. If you require a particular alignment, either make sure you're compiling for C++17 or use specialized APIs, like posix_memalign.