Don't fully understand custom-written 'memcpy' function in C

So I was browsing the Quake engine source code earlier today and stumbled upon some written utility functions. One of them was 'Q_memcpy':

void Q_memcpy (void *dest, void *src, int count)
{
    int             i;

    if (( ( (long)dest | (long)src | count) & 3) == 0 )
    {
        count>>=2;
        for (i=0 ; i<count ; i++)
            ((int *)dest)[i] = ((int *)src)[i];
    }
    else
        for (i=0 ; i<count ; i++)
            ((byte *)dest)[i] = ((byte *)src)[i];
}

I understand the whole premise of the function but I don't quite understand the reason for the bitwise OR between the source and destination address. So the sum of my questions are as follows:

Why does 'count' get used in the same bitwise arithmetic?
Why is that result's last two bits checked if they are differing?
What purpose does this whole check serve?

I'm sure it's something obvious but please excuse my ignorance because I haven't really delved into the more low level side of things when it comes to programming. I just find it interesting and want to learn more.

Solution

It first tests if all 3 arguments are divisible by 4. If - and only if - they all are, it proceeds with copying 4 bytes at a time.

I.e. this undecoded would be

if ((long) src % 4 == 0 && (long) dst % 4 == 0 && count % 4 == 0 )
{
    count = count / 4;
    for (i = 0; i < count; i++)
        ((int *)dest)[i] = ((int *)src)[i];
}

I am not sure if they tested their compiler and it generated bad code for even a test, and therefore they decided to write it in such a convoluted way. In any case, the x | y | z will guarantee that a bit n is set in the result if it is set in any of x, y or z. Therefore if the (x | y | z) & 3 results in 0, none of the numbers had either of the 2 lowest bits set, and therefore are divisible by 4.

Of course it would be rather silly to use now - the standard library memcpy in recent library implementations is almost certainly better than this.

Therefore, on recent compilers you can optimize all calls to Q_memcpy by switching them to memcpy. GCC could generate things like 64-bit or SIMD moves with memcpy depending on the size of area to be copied.