So I was browsing the Quake engine source code earlier today and stumbled upon some written utility functions. One of them was 'Q_memcpy':
void Q_memcpy (void *dest, void *src, int count)
{
int i;
if (( ( (long)dest | (long)src | count) & 3) == 0 )
{
count>>=2;
for (i=0 ; i<count ; i++)
((int *)dest)[i] = ((int *)src)[i];
}
else
for (i=0 ; i<count ; i++)
((byte *)dest)[i] = ((byte *)src)[i];
}
I understand the whole premise of the function but I don't quite understand the reason for the bitwise OR between the source and destination address. So the sum of my questions are as follows:
I'm sure it's something obvious but please excuse my ignorance because I haven't really delved into the more low level side of things when it comes to programming. I just find it interesting and want to learn more.
It first tests if all 3 arguments are divisible by 4. If - and only if - they all are, it proceeds with copying 4 bytes at a time.
I.e. this undecoded would be
if ((long) src % 4 == 0 && (long) dst % 4 == 0 && count % 4 == 0 )
{
count = count / 4;
for (i = 0; i < count; i++)
((int *)dest)[i] = ((int *)src)[i];
}
I am not sure if they tested their compiler and it generated bad code for even a test, and therefore they decided to write it in such a convoluted way. In any case, the x | y | z
will guarantee that a bit n is set in the result if it is set in any of x
, y
or z
. Therefore if the (x | y | z) & 3
results in 0, none of the numbers had either of the 2 lowest bits set, and therefore are divisible by 4.
Of course it would be rather silly to use now - the standard library memcpy
in recent library implementations is almost certainly better than this.
Therefore, on recent compilers you can optimize all calls to Q_memcpy
by switching them to memcpy
. GCC could generate things like 64-bit or SIMD moves with memcpy
depending on the size of area to be copied.