I am working on a C++ function to allocate multiple buffers in memory. The buffers have to be N-byte aligned since the data they hold will be processed with various types of SIMD instruction sets (SSE, AVX, AVX512, etc...)
In the Apple Core Audio Utility Classes online I found this piece of code:
void CABufferList::AllocateBuffers(UInt32 nBytes)
{
if (nBytes <= GetNumBytes()) return;
if (mABL.mNumberBuffers > 1) {
// align successive buffers for Altivec and to take alternating
// cache line hits by spacing them by odd multiples of 16
nBytes = ((nBytes + 15) & ~15) | 16;
}
UInt32 memorySize = nBytes * mABL.mNumberBuffers;
Byte *newMemory = new Byte[memorySize], *p = newMemory;
memset(newMemory, 0, memorySize); // get page faults now, not later
AudioBuffer *buf = mABL.mBuffers;
for (UInt32 i = mABL.mNumberBuffers; i--; ++buf) {
if (buf->mData != NULL && buf->mDataByteSize > 0) {
// preserve existing buffer contents
memcpy(p, buf->mData, buf->mDataByteSize);
}
buf->mDataByteSize = nBytes;
buf->mData = p;
p += nBytes;
}
Byte *oldMemory = mBufferMemory;
mBufferMemory = newMemory;
mBufferCapacity = nBytes;
delete[] oldMemory;
}
The code is pretty straight forward however there is one line that I just don't fully grasp:
nBytes = ((nBytes + 15) & ~15) | 16;
I understand it's aligning/quantizing the number of bytes to 16, however I don't understand why it's using a bitwise OR 16 at the end. The comment says: "to take alternating cache line hits by spacing them by odd multiples of 16". Excuse my thickness, but I still don't get it.
So I have three questions:
1) What does | 16;
do exactly and why is it done?
2) Considering the context of memory allocation and data access, how and in what terms does | 16;
improve the code? From the comments in the code I can guess it is related to cache access, but I don't understand the whole "alternating cache line hits" bit. How does spacing the memory allocation addresses by odd multiples of 16 improve on cache access?
3) Am I right thinking that the above function will only work correctly based on the assumption that the new operator will return at least 16-byte aligned memory? In C++ the new operator is defined as returning a pointer to storage with alignment suitable for any object with a fundamental alignment requirement, which might not necessarily be 16 bytes.
Based on the comment referring to Altivec, this is specific to the Power architecture, which I'm not familiar with. Also, the code is incomplete, but it looks like the allocated memory is organized in one or multiple adjacent buffers, and the size adjustment only works when there are multiple buffers. We don't know how data is accessed in these buffers. There will be a lot of assumptions in this answer, to the point that it may be totally incorrect. I'm posting it mostly because it's too large for a comment.
I can see one possible advantage of the size modification. First, let's remember some details about Power architecture:
Now, let's take an example that AllocateBuffers
allocates memory for 4 buffers (i.e. mABL.mNumberBuffers
is 4) and nBytes
is 256. Let's see how these buffers are laid out in memory:
| Buffer 1: 256+16=272 bytes | Buffer 2: 272 bytes | Buffer 3: 272 bytes | Buffer 4: 272 bytes |
^ ^ ^ ^
| | | |
offset: 0 272 544 816
Notice the offset values and compare them against cache line boundaries. For simplicity, let's assume the memory is allocated at the cache line boundary. It doesn't really matter, as will be shown below.
Note how the offset from the nearest cache line boundary increments by 16 bytes. Now, if we assume that data in each of the buffers will be accessed in 16-byte chunks, in forward direction, in a loop then the cache lines are fetched from memory in a rather specific order. Let's consider the middle of the loop (since in the beginning CPU will have to fetch cache lines for the beginning of every buffer):
Note that the order in which new cache lines are fetched from memory does not depend on the order of accessing buffers within each loop iteration. Also, it does not depend on whether the whole memory allocation was aligned to a cache line boundary. Also note that if buffer contents were accessed in reverse order then the cache lines would be fetched in forward order, but still in order.
This ordered cache line fetching may aid hardware prefercher in the CPU, so, when the next loop iteration is executed, the required cache line is already pre-fetched. Without it, every 8th iteration of the loop would require 4 new cache lines in whatever order the buffers are accessed by the program, which could be interpreted as random access to memory and hamper the prefetcher. Depending on the loop complexity, this 4 cache line fetch may not be hidden by the out-of-order execution model and introduce a stall. This is less likely to happen when you only fetch up to 1 cache line per iteration.
Another possible benefit is avoiding address aliasing. I don't know cache organization of Power, but if nBytes
is a multiple of a page size, using multiple buffers at once, when each buffer is page-aligned, could result in lots of false dependencies and hamper store-to-load forwarding. Though the code does the adjustment not just in case when nBytes
is a multiple of a page size, so aliasing probably was not the main concern.
- Am I right thinking that the above function will only work correctly based on the assumption that the new operator will return at least 16-byte aligned memory? In C++ the new operator is defined as returning a pointer to storage with alignment suitable for any object with a fundamental alignment requirement, which might not necessarily be 16 bytes.
Yes, C++ does not guarantee any particular alignment, other than it is suitable for storing any object of fundamental type. C++17 adds support for dynamic allocations for over-aligned types.
However, even with older C++ versions, every compiler also adheres to the target system ABI specification, which may specify alignment for memory allocations. In practice, on many systems malloc
returns at least 16-byte aligned pointers and operator new
uses memory returned by malloc
or similar lower level API.
It's not portable though, and therefore not a recommended practice. If you require a particular alignment, either make sure you're compiling for C++17 or use specialized APIs, like posix_memalign
.