How to understand this code snippet in the bcopy.c of bionic?

I read the memcpy implementation in the http://androidxref.com/4.0.4/xref/bionic/libc/string/bcopy.c find following code is hard to understand, can anybody explain it?

 36 /*
 37  * sizeof(word) MUST BE A POWER OF TWO
 38  * SO THAT wmask BELOW IS ALL ONES
 39  */
 40 typedef long word;      /* "word" used for optimal copy speed */
 41 
 42 #define wsize   sizeof(word)
 43 #define wmask   (wsize - 1)
 44

...

        /*
 78          * Copy forward.
 79          */
 80         t = (long)src;  /* only need low bits */
 81         if ((t | (long)dst) & wmask) {
 82             /*
 83              * Try to align operands.  This cannot be done
 84              * unless the low bits match.
 85              */
 86             if ((t ^ (long)dst) & wmask || length < wsize)
 87                 t = length;
 88             else
 89                 t = wsize - (t & wmask);

What is mean of these bitwise operations? What is their intent?

Solution

The basic idea is to meet alignment constraints: each "word" to be copied one word at a time must be aligned on a "word" boundary.

Some CPUs have this as a fundamental constraint, that loads and stores must occur on a "natural" boundary. On older ARM processors the low bits of the address are actually ignored entirely, so that a load or store of two bytes from an odd-valued address have the same effect as:

short w = *(short *)(addr & ~1);

for instance. On some other CPUs an unaligned load or store results in a trap (MIPS and SPARC for instance), and still others will just do it, but with a performance penalty (x86).

So, suppose you're copying a large number of bytes (say, 4096 of them) from address 0x12345 to address 0x22345, and that the "word size" is 4 bytes. If we first copy three bytes, the addresses will now be 0x12348 and 0x22348. At this point we can copy just 1023 4-byte words, one word at a time, without tripping over any alignment issues. After that we'll have one remaining byte to copy, because 4096 = 3 + (4 * 1023) + 1.

This all makes the assumption that bytes are all addressed individually, even when loading and storing "words". This assumption is false on some machines: for instance, the old Data General MV10000 CPU would address "words" using "word addresses", which are essentially byte addresses divided by two. (It's thus impossible to address a "word" that spans two bytes: the word at location 0 has byte addresses 0 and 1 but word address 0; the word at location 1 has byte addresses 2 and 3; the word at location 2 has byte addresses 4 and 5; and so on.) On machines like this, you may need to use a different version of bcopy.c.

As @Alex noted, the XOR is just making sure that it is actually possible to align the two addresses. If you're copying from 0x12345 to 0x22345, it is; but if you're copying from 0x12345 to 0x22344, the two addresses will never line up.