Are word-aligned loads faster than unaligned loads on x64 processors?

Are loads of variables that are aligned on word boundaries faster than unaligned load operations on x86/64 (Intel/AMD 64 bit) processors?

A colleague of mine argues that unaligned loads are slow and should be avoided. He cites the padding of items to word boundaries in structs as a proof that unaligned loads are slow. Example:

struct A {
  char a;
  uint64_t b;
};

The struct A as usually a size of 16 bytes.

On the other hand, the documentation of the Snappy compressor states that Snappy assumes that "unaligned 32- and 64-bit loads and stores are cheap". According to the source code this is true of Intel 32 and 64-bit processors.

So: What is the truth here? If and by how much are unaligned loads slower? Under which circumstances?

Solution

Unaligned 32 and 64 bit access is NOT cheap.

I did tests to verify this. My results on Core i5 M460 (64 bit) were as follows: fastest integer type was 32 bit wide. 64 bit alignment was slightly slower but almost the same. 16 bit alignment and 8 bit alignment were both noticeably slower than both 32 and 64 bit alignment. 16 bit being slower than 8 bit alignment. The by far slowest form of access was non aligned 32 bit access that was 3.5 times slower than aligned 32 bit access (fastest of them) and unaligned 32 bit access was even 40% slower than unaligned 64 bit access.

Results: https://github.com/mkschreder/align-test/blob/master/results-i5-64bit.jpg?raw=true Source code: https://github.com/mkschreder/align-test