How do 3 random bytes make the code 10 times faster?

I'm reading Agner fog manual 2 - "Optimizing subroutines in assembly language" - section 9.2 - "Out of order execution".
I copied this example from it to check its behavior:

; first case = loop without extra bytes.
section .data
    ; using db instead of dd was a mistake which later led
    ; me to this question.
    mem1: db 23
    mem3: db 23
section .bss
    mem2: resd 1
    mem4: resd 1

section .text
    global _start
_start:
    mov rcx, 300000000

loop:
    ; waste time
    mov eax, dword [mem1]
    imul eax, 6
    mov [mem2], eax
    mov edx, dword [mem3]
    imul edx, 8
    mov [mem4], edx

    sub ecx, 1
    jnz loop

    mov eax, 60
    xor edi, edi
    syscall

Compiling and running:

% nasm -f elf64 outoforder.asm && ld outoforder.o -o outoforder && time ./outoforder
./outoforder  1.46s user 0.00s system 99% cpu 1.458 total

Several runs and using EAX instead of EDX take the same time, now the part that baffles me: if I add at least three random bytes around mem1 and mem3 the code is 10 times faster!
For example:

; second case = loop + at least three random bytes around mem1 and mem3
section .data
    db 1
    mem1: db 23
    db 1
    mem3: db 23
    db 1

Compiling and running:

% nasm -f elf64 outoforder.asm && ld outoforder.o -o outoforder && time ./outoforder
./outoforder  0.13s user 0.00s system 99% cpu 0.130 total

No matter how or where I put them, I get the same result, a few more examples that produces the same result:

mem1: db 23
times 3 db 1
mem3: db 23

mem1: dw 23
mem3: dw 23
db 8

mem1: db 23
mem3: db 23
times 3 db 1

At first I asked some AI's and they tell me that the problem is the alignment, but objdump shows that the addresses are not aligned and as I said before, it doesn't matter where I put the bytes, the result is the same:

# disassembly of the second case.

% objdump -d outoforder

outoforder:     file format elf64-x86-64


Disassembly of section .text:

0000000000401000 <_start>:
  401000:       b9 00 a3 e1 11          mov    $0x11e1a300,%ecx

0000000000401005 <loop>:
  401005:       8b 04 25 01 20 40 00    mov    0x402001,%eax // not aligned
  40100c:       6b c0 06                imul   $0x6,%eax,%eax
  40100f:       89 04 25 08 20 40 00    mov    %eax,0x402008
  401016:       66 8b 04 25 03 20 40    mov    0x402003,%ax // not aligned
  40101d:       00 
  40101e:       66 6b c0 08             imul   $0x8,%ax,%ax
  401022:       66 89 04 25 0c 20 40    mov    %ax,0x40200c
  401029:       00 
  40102a:       83 e9 01                sub    $0x1,%ecx
  40102d:       75 d6                   jne    401005 <loop>
  40102f:       b8 3c 00 00 00          mov    $0x3c,%eax
  401034:       31 ff                   xor    %edi,%edi
  401036:       0f 05                   syscall

And I don't think these bytes are doing something weird with the cache. I did more tests just to find some pattern, but I couldn't find anything useful, rather I just got more confused.

Using DX instead of EDX eliminates the need for the 3 bytes, because adding them doesn't make any difference, the loop would be:

; third case = loop + partial register dx (extra bytes doesn't make any difference)
mov eax, dword [mem1]
imul eax, 6
mov [mem2], eax
mov dx, [mem3]
imul dx, 8
mov [mem4], dx

This makes my code faster than the first case but slower than the second case:

% nasm -f elf64 outoforder.asm && ld outoforder.o -o outoforder && time ./outoforder
./outoforder  0.32s user 0.00s system 99% cpu 0.321 total

And using AX instead of DX makes the code as fast as the second case:

; fourth case = loop + partial register ax (extra bytes doesn't make any difference)
mov eax, dword [mem1]
imul eax, 6
mov [mem2], eax
mov ax, [mem3]
imul ax, 8
mov [mem4], ax

% nasm -f elf64 outoforder.asm && ld outoforder.o -o outoforder && time ./outoforder
./outoforder  0.13s user 0.00s system 99% cpu 0.135 total

This also confuses me, according to what I have read, shouldn't it be the other way round? i.e Doesn't AX create a dependency on the upper 16 bits of EAX? And why does adding bytes in both cases (using AX or DX) make no difference now?

Deleting the first middle of the loop (the first three instructions that use EAX):

; fifth case = only the second middle of the loop with ax (using ax or dx doesn't make any difference)
mov ax, [mem3]
imul ax, 8
mov [mem4], ax

This takes the same time as the complete loop + partial register DX:

% nasm -f elf64 outoforder.asm && ld outoforder.o -o outoforder && time ./outoforder
./outoforder  0.31s user 0.00s system 99% cpu 0.309 total

And replacing AX by EAX:

; sixth case = only the second middle of the loop with eax (using eax or edx doesn't make any difference)
mov eax, [mem3]
imul eax, 8
mov [mem4], eax

% nasm -f elf64 outoforder.asm && ld outoforder.o -o outoforder && time ./outoforder
./outoforder  0.06s user 0.00s system 99% cpu 0.063 total

And again adding bytes around mem3 or mem1 doesn't make any difference. I would greatly appreciate a detailed explanation, especially the first example which seems to me the most unbelievable.

processor: 11th Gen Intel(R) Core(TM) i7-11700F @ 2.50GHz
os: arch linux 6.6.10-arch1-1
arch: x86-64

Solution

3 bytes of padding plus 2 bytes of db 23 is 5 bytes, resulting in 3 bytes of padding between the end of .data and the start of .bss.

It's these padding bytes that are key; even putting times 3 db 1 before mem1: makes it fast. But using section .bss align=1 makes it slow.

This is sufficient to make loads from mem1 or mem2 not overlap with stores to mem3, so each loop iteration is independent, limited only by throughput things like the front-end or 1 imul per cycle (port 1). Otherwise one or both loads partially overlap the store to mem3 so they have to wait for it (data dependency) and the latency is extra high because it's a store-forwarding stall. (perf event ld_blocks.store_forward). (What are the costs of failed store-to-load forwarding on x86?)

.data and .bss are contiguous within the same ELF segment in the executable. The linker implements a BSS by specifying a larger MemSiz than FileSiz for that segment so the trailing byte of the mapping are zero-filled. (Check readelf -a a.out). But it pads if necessary to align the start of each section by 4, the default section alignment requested by NASM.

The relevant perf event is ld_blocks.store_forward, like perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,ld_blocks.store_forward ./foo. If the count is far from zero, you have partial overlap between a store and load that follows soon after.

PS: your disassembly shows different code than your NASM source, using partial registers like loading into AX instead of EDX, so it had a dependency on the old value of RAX on CPUs other than P6-family (PPro through Nehalem).

At first I assumed it wasn't a proper MCVE, but I could reproduce the slow vs. fast effect on my i7-6700k Skylake. So it's not a register dependency that's the problem there, it's memory.

I guess that's your "case 2"? If you don't have a overlapping load/store, it can also be fast: you have one longer dependency chain within each iteration, but headed by an independent mov eax, [mem1] at the start of each iteration. The out-of-order exec window on modern x86 is deep enough to overlap much longer dep chains than that across iterations, see Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths

Your later cases avoiding the memory dependency / store-forwarding stall bottleneck either on throughput of 1 imul per clock (port 1), or on partial-register writes creating a loop-carried dependency through RDX. (mov dx, [mem] merges a new value into the bottom of RDX, with a data dependency on the previous value.) So a 5 cycle loop-carried dependency: 1 cycle ALU merge as part of a 16-bit load, 4 cycle imul dx, dx, imm8.

imul r16, r16, imm8 is 2 uops on Intel, because the destination could be a different 16-bit register. (https://uops.info/). So there's a port 1 uop for the multiply with 3 cycle latency, and a separate merge uop for any ALU port (0156). https://uops.info/html-lat/RKL/IMUL_IMMB_R16_R16_I8-Measurements.html has the latencies from different inputs to different outputs. e.g. the source register to FLAGS latency is 3 cycles, the destination to destination latency is 1 cycle. So the multiply can start before the merge destination is ready.
imul dx, cx is a single uop with 3c latency because the destination is also a source, so it can just be an RMW that modifies the bottom of a 64-bit register. But imul dx, 8 is actually imul dx, dx, 8 in the machine code, and doesn't get special-cased. 16-bit operand-size isn't important enough to spend more transistors on. If code wanted to be fast, it would have done movzx edx, word [mem3] / imul edx, 8 (or much better shl edx, 3) / mov [mem4], dx

This is the factor of 2.5 or 5 in performance between your 1/clock (1x imul) or 2/clock (2x imul) throughput-bound versions vs. latency-bound versions. Rocket Lake's front-end is 5 fused-domain uops wide, and can sustain 2 stores per clock if they're to the same cache line. (sub ecx,1 / jne macro-fuse into a single uop.)

uiCA, the uops.info Code Analyzer

BTW, you can copy/paste your asm into https://uica.uops.info/ (it has an option for NASM) and have it analyze dependencies and predict performance for Intel CPUs. (i7 11700F is Rocket Lake, one of the microarchitectures uiCA has a model for.) It uses measured uop data from https://uops.info/, which is generally more accurate than Agner Fog's instruction tables, and doesn't have copy/paste errors or typos from him doing his spreadsheet manually. And you can see the test loops used for any measurement by clicking on a number in the table, so if there is something surprising, you can see if the measurement failed to avoid a false dependency or something. Agner Fog's microarch PDF is still excellent, but for actual timing data I only look at his tables for older AMD CPUs and others that uiCA hasn't tested.

Only paste the actual loop body; uiCA implicitly treats the block of asm as a loop body whether there's a jmp at the bottom or not, so you don't want it thinking you have a no-taken jcc in the loop and a syscall at the bottom.

Also, it only analyzes a .o from assembling without linking (where all symbol addresses are the placeholder [0] so all stores and loads are to the same address.) So you'd have to manually fill in the numeric addresses from disassembly of your linked executable, like in your 2nd code block. Or do something like change [mem1] to [rdi+0], [mem3] to [rdi+1] etc. Actually I just tried it, and it doesn't detect partial overlap at all, only same address.

uiCA does work well for the cases where partial register stuff or just throughput is the bottleneck.

Check the "dependencies" and/or "trace" options to have it show you a directed-graph of dependencies between instruction with the critical path marked in red, or look at the trace for which instructions in later iterations can or can't dispatch (be sent to an execution unit because their inputs are ready) soon after issue into the back-end. Those are the instructions with a data dependency on something earlier. Like for example mov dx, [mem3] depends on the old value of RDX, unlike mov edx, [mem3] which breaks dependencies, starting a new chain.

So you either have a 5-cycle latency chain through a register (merge a load result + 4-cycle imul), or you have a 1 or 2 cycle throughput bottleneck on imul, or perhaps a front-end bottleneck.

For example, uiCA analysis of a version with loads into DX, forming the 5 cycle loop-carried dep chain. If I hadn't looked at uiCA, I'd have still been assuming that imul dx, 8 was 3 cycle latency like I know imul edx, 8 is. But it shows the imul as 2 uops, which is where the extra cycle of latency came from. (The load/store addresses in that version would overlap if you uncommented the EAX part, but uiCA doesn't notice. :/)

PS: you made this intentionally slow, but you could have picked a multiplier like 123 that couldn't have been done more efficiently with shl dx, 3 or something. And you could have used default rel to get RIP-relative loads for more compact machine-code.

Also, imul dx, [mem3], 123 would avoid the ALU merge uop that's part of mov dx, [mem3], having only the ALU merge uop from the imul. That would shorten your loop-carried dependency chain to 4 cycles. Of course it would be better to do a movzx load into EDX to avoid a loop-carried dependency entirely.