Why do align access and non-align access have same performance?

From Intel CPU manual (Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D):System Programming Guide 8.1.1), it says "nonaligned data accesses will seriously impact the performance of the processor". Then I do a test in order to prove it, but the result is that aligned and nonaligned data accesses have the same performance. Why? Could someone help? My code is shown below:

#include <iostream>
#include <stdint.h>
#include <time.h>
#include <chrono>
#include <string.h>
using namespace std;

static inline int64_t get_time_ns()
{
    std::chrono::nanoseconds a = std::chrono::high_resolution_clock::now().time_since_epoch();
    return a.count();
}
int main(int argc, char** argv)
{
    if (argc < 2) {
        cout << "Usage：./test [01234567]" << endl;
        cout << "0 - aligned, 1-7 - nonaligned offset" << endl;
        return 0;
    }
    uint64_t offset = atoi(argv[1]);
    cout << "offset = " << offset << endl;
    const uint64_t BUFFER_SIZE = 800000000;
    uint8_t* data_ptr = new uint8_t[BUFFER_SIZE];
    if (data_ptr == nullptr) {
        cout << "apply for memory failed" << endl;
        return 0;
    }
    memset(data_ptr, 0, sizeof(uint8_t) * BUFFER_SIZE);
    const uint64_t LOOP_CNT = 300;
    cout << "start" << endl;
    auto start = get_time_ns();
    for (uint64_t i = 0; i < LOOP_CNT; ++i) {
        for (uint64_t j = offset; j <= BUFFER_SIZE - 8; j+= 8) { // align:offset = 0 nonalign: offset=1-7
            volatile auto tmp = *(uint64_t*)&data_ptr[j]; // read from memory
            //mov rax,QWORD PTR [rbx+rdx*1] // rbx+rdx*1 = 0x7fffc76fe019 
            //mov QWORD PTR [rsp+0x8],rax 
            ++tmp;
            //mov rcx,QWORD PTR [rsp+0x8] 
            //add rcx,0x1 
            //mov QWORD PTR [rsp+0x8],rcx
            *(uint64_t*)&data_ptr[j] = tmp; // write to memory
            //mov rcx,QWORD PTR [rbx+rdx*1],rcx
        }
    }
    auto end = get_time_ns();
    cout << "time elapse " << end - start << "ns" << endl;
    return 0;
}

Result:

offset = 0
start
time elapse 32991486013ns
offset = 1
start
time elapse 34089866539ns
offset = 2
start
time elapse 34011790606ns
offset = 3
start
time elapse 34097518021ns
offset = 4
start
time elapse 34166815472ns
offset = 5
start
time elapse 34081477780ns
offset = 6
start
time elapse 34158804869ns
offset = 7
start
time elapse 34163037004ns

Solution

On most modern x86 cores, the performance of aligned and misaligned is the same only if the access does not cross a specific internal boundary.

The exact size of the internal boundary varies based on the core architecture of the relevant CPU, but on Intel CPUs from the last decade, the relevant boundary is the 64-byte cache line. That is, accesses which fall entirely within a 64-byte cache line perform the same regardless of whether they are aligned or not.

If a (necessarily misaligned) access crosses a cache line boundary on an Intel chip, however, a penalty of about 2x is paid in both latency and throughput. The bottom-line impact of this penalty depends on the surrounding code and will often be much less than 2x and sometimes close to zero. This modest penalty may be much larger if a 4K page boundary is also crossed.

Aligned accesses never cross these boundaries, so cannot suffer this penalty.

The broad picture is similar for AMD chips, though the relevant boundary as been smaller than 64 bytes on some recent chips, and the boundary is different for loads and stores.

I have included additional details in the load throughput and store throughput sections of a blog post I wrote.

Testing It

Your test wasn't able to show the effect for several reasons:

The test didn't allocate aligned memory, you can't reliably cross a cache line by using an offset from a region with unknown alignment.
You iterated 8 bytes at a time, so the majority of the writes (7 out of 8) will fall in a cache line any have no penalty, leading to a small signal which will only be detectable if the rest of your benchmark is very clean.
You use a large buffer size, which doesn't fit in any level of the cache. The split-line effect is only fairly obvious at the L1, or when splitting lines mean you bring in twice the number of lines (e.g., random access). Since you access every line linearly in either scenario, you'll be limited by throughput from DRAM to the core, regardless of splits or not: the split writes have plenty of time to complete while waiting for main memory.
You use a local volatile auto tmp and tmp++ which creates a volatile on the stack and a lot of loads and stores to preserve volatile semantics: these are all aligned and will wash out the effect you are trying to measure with your test.

Here is my modification of your test, operating only in the L1 region, and which advances 64 bytes at a time, so every store will be a split if any is:

#include <iostream>
#include <stdint.h>
#include <time.h>
#include <chrono>
#include <string.h>
#include <iomanip>

using namespace std;

static inline int64_t get_time_ns()
{
    std::chrono::nanoseconds a = std::chrono::high_resolution_clock::now().time_since_epoch();
    return a.count();
}

int main(int argc, char** argv)
{
    if (argc < 2) {
        cout << "Usage：./test [01234567]" << endl;
        cout << "0 - aligned, 1-7 - nonaligned offset" << endl;
        return 0;
    }
    uint64_t offset = atoi(argv[1]);
    const uint64_t BUFFER_SIZE = 10000;
    alignas(64) uint8_t data_ptr[BUFFER_SIZE];
    memset(data_ptr, 0, sizeof(uint8_t) * BUFFER_SIZE);
    const uint64_t LOOP_CNT = 1000000;
    auto start = get_time_ns();
    for (uint64_t i = 0; i < LOOP_CNT; ++i) {
        uint64_t src = rand();
        for (uint64_t j = offset; j + 64<= BUFFER_SIZE; j+= 64) { // align:offset = 0 nonalign: offset=1-7
            memcpy(data_ptr + j, &src, 8);
        }
    }
    auto end = get_time_ns();
    cout << "time elapsed " << std::setprecision(2) << (end - start) / ((double)LOOP_CNT * BUFFER_SIZE / 64) <<
        "ns per write (rand:" << (int)data_ptr[rand() % BUFFER_SIZE] << ")" << endl;
    return 0;
}

Running this for all alignments in 0 to 64, I get:

$ g++ test.cpp -O2 && for off in {0..64}; do printf "%2d :" $off && ./a.out $off; done
 0 :time elapsed 0.56ns per write (rand:0)
 1 :time elapsed 0.57ns per write (rand:0)
 2 :time elapsed 0.57ns per write (rand:0)
 3 :time elapsed 0.56ns per write (rand:0)
 4 :time elapsed 0.56ns per write (rand:0)
 5 :time elapsed 0.56ns per write (rand:0)
 6 :time elapsed 0.57ns per write (rand:0)
 7 :time elapsed 0.56ns per write (rand:0)
 8 :time elapsed 0.57ns per write (rand:0)
 9 :time elapsed 0.57ns per write (rand:0)
10 :time elapsed 0.57ns per write (rand:0)
11 :time elapsed 0.56ns per write (rand:0)
12 :time elapsed 0.56ns per write (rand:0)
13 :time elapsed 0.56ns per write (rand:0)
14 :time elapsed 0.56ns per write (rand:0)
15 :time elapsed 0.57ns per write (rand:0)
16 :time elapsed 0.56ns per write (rand:0)
17 :time elapsed 0.56ns per write (rand:0)
18 :time elapsed 0.56ns per write (rand:0)
19 :time elapsed 0.56ns per write (rand:0)
20 :time elapsed 0.56ns per write (rand:0)
21 :time elapsed 0.56ns per write (rand:0)
22 :time elapsed 0.56ns per write (rand:0)
23 :time elapsed 0.56ns per write (rand:0)
24 :time elapsed 0.56ns per write (rand:0)
25 :time elapsed 0.56ns per write (rand:0)
26 :time elapsed 0.56ns per write (rand:0)
27 :time elapsed 0.56ns per write (rand:0)
28 :time elapsed 0.57ns per write (rand:0)
29 :time elapsed 0.56ns per write (rand:0)
30 :time elapsed 0.57ns per write (rand:25)
31 :time elapsed 0.56ns per write (rand:151)
32 :time elapsed 0.56ns per write (rand:123)
33 :time elapsed 0.56ns per write (rand:29)
34 :time elapsed 0.55ns per write (rand:0)
35 :time elapsed 0.56ns per write (rand:0)
36 :time elapsed 0.57ns per write (rand:0)
37 :time elapsed 0.56ns per write (rand:0)
38 :time elapsed 0.56ns per write (rand:0)
39 :time elapsed 0.56ns per write (rand:0)
40 :time elapsed 0.56ns per write (rand:0)
41 :time elapsed 0.56ns per write (rand:0)
42 :time elapsed 0.57ns per write (rand:0)
43 :time elapsed 0.56ns per write (rand:0)
44 :time elapsed 0.56ns per write (rand:0)
45 :time elapsed 0.56ns per write (rand:0)
46 :time elapsed 0.57ns per write (rand:0)
47 :time elapsed 0.57ns per write (rand:0)
48 :time elapsed 0.56ns per write (rand:0)
49 :time elapsed 0.56ns per write (rand:0)
50 :time elapsed 0.57ns per write (rand:0)
51 :time elapsed 0.56ns per write (rand:0)
52 :time elapsed 0.56ns per write (rand:0)
53 :time elapsed 0.56ns per write (rand:0)
54 :time elapsed 0.55ns per write (rand:0)
55 :time elapsed 0.56ns per write (rand:0)
56 :time elapsed 0.56ns per write (rand:0)
57 :time elapsed 1.1ns per write (rand:0)
58 :time elapsed 1.1ns per write (rand:0)
59 :time elapsed 1.1ns per write (rand:0)
60 :time elapsed 1.1ns per write (rand:0)
61 :time elapsed 1.1ns per write (rand:0)
62 :time elapsed 1.1ns per write (rand:0)
63 :time elapsed 1ns per write (rand:0)
64 :time elapsed 0.56ns per write (rand:0)

Note that offsets 57 through 63 all take about 2x as long per write, and those are exactly the offsets that cross a 64-byte (cache line) boundary for an 8-byte write.