Search code examples
cperformanceopenmpxeon-phi

Padding array manually


I am trying to understand 9 point stencil's algorithm from this book , the logic is clear to me , but the calculation of WIDTHP macro is what i am unable to understand, here is the breif code (original code is more than 300 lines length!!):

#define PAD64 0
#define WIDTH 5900
#if PAD64
#define WIDTHP ((((WIDTH*sizeof(REAL))+63)/64)*(64/sizeof(REAL)))
#else
#define WIDTHP WIDTH
#endif
#define HEIGHT 10000

REAL *fa = (REAL *)malloc(sizeof(REAL)*WIDTHP*HEIGHT);
REAL *fb = (REAL *)malloc(sizeof(REAL)*WIDTHP*HEIGHT);

original array is 5900 X 10000, but if i define PAD64 , the array becomes 5915.75 X 10000

Though so far i can guess that the author is trying to align & pad array to 64 byte boundary. But array returned by malloc is usually aligned(& padded) , also, the posix_memalign gives you a chunk of memory that is guaranteed to have the requested alignment , we can also use

__attribute__((align(64)))

what impact does this WIDTHP can make on my code's performance?


Solution

  • I was going to put this in as a comment to unwind's answer because he's right. But perhaps I can explain more clearly, albeit in more characters than will fit in a comment.

    When I do the math, I get 5904 reals, which is 23616 bytes, which is 396 cache lines for 64 byte cache lines. It is the bytes, rather than the number of elements which must be a multiple of 64.

    As to why you want to pad the value of width, lets look at a smaller example. Let's pretend we had a "cache line" that holds 10 letter and that we have an "array" with a width of 8 letters and height of 4. Now since our hypothetical array is in C and C is row major, the array will look something like this: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD

    but what does it look like when it is arranged in cache lines, since those are 10 letters long: AAAAAAAABB BBBBBBCCCC CCCCDDDDDD DD

    Not good. Only the first row of the array is aligned. But if we pad width by two spaces, we get this in cache: AAAAAAAA__ BBBBBBBB__ CCCCCCCC__ DDDDDDDD__

    which is what we want. Now we can have a nested loop like

    for i = 1 to height
       for j = 1 to width
    

    and know that every time we start to work on the j loop, the data we need will be aligned.

    Oh, and yes, they really should do something to make sure that the first element of the array is aligned. 'attribute((align(64)))' won't work because the arrays are being allocated dynamically but they could have used posix_memalign instead of malloc.