Why isn't MOVNTI slower, in a loop storing repeatedly to the same address?

section .text

%define n    100000

_start:
xor rcx, rcx

jmp .cond
.begin:
    movnti [array], eax

.cond:
    add rcx, 1 
    cmp rcx, n
    jl .begin


section .data
array times 81920 db "A"

According to perf it runs at 1.82 instructions per cycle. I cannot understand why it's so fast. After all, it has to be stored in memory (RAM) so it should be slow.

P.S Is there any loop-carried-dependency?

EDIT

section .text

%define n    100000

_start:
xor rcx, rcx

jmp .cond
.begin:
    movnti [array+rcx], eax

.cond:
    add rcx, 1 
    cmp rcx, n
    jl .begin


section .data
array times n dq 0

Now, the iteration take 5 cycle per iteration. Why? After all, there is still no loop-carried-dependency.

Solution

movnti can apparently sustain a throughput of one per clock when writing to the same address repeatedly.

I think movnti keeps writing into the same fill buffer, and it's not getting flushed very often because there are no other loads or stores happening. (That link is about copying from WC video memory with SSE4.1 NT loads, as well as storing to normal memory with NT stores.)

So the NT write-combining fill-buffer acts like a cache for multiple overlapping NT stores to the same address, and writes are actually hitting in the fill buffer instead of going to DRAM each time.

DDR DRAM only supports burst-transfer commands. If every movnti produced a 4B write that actually was visible to the memory chips, there'd be no way it could run that fast. The memory controller either has to read/modify/write, or do an interrupted burst transfer, since there is no non-burst write command. See also Ulrich Drepper's What Every Programmer Should Know About Memory.

We can further prove this is the case by running the test on multiple cores at once. Since they don't slow each other down at all, we can be sure that the writes are only infrequently making it out of the CPU cores and competing for memory cycles.

The reason your experiment doesn't show your loop running at 4 instruction per clock (one cycle per iteration) is that you used such a tiny repeat count. 100k cycles barely accounts for the startup overhead (which perf's timing includes).

For example, on a Core2 E6600 (Merom/Conroe) with dual channel DDR2 533MHz, the total time including all process startup / exit stuff is 0.113846 ms. That's only 266,007 cycles.

A more reasonable microbenchmark shows one iteration (one movnti) per cycle:

global _start
_start:
    xor ecx,ecx
.begin:
    movnti  [array], eax
    dec     ecx
    jnz     .begin         ; 2^32 iterations

    mov eax, 60     ; __NR_exit
    xor edi,edi
    syscall         ; exit(0)

section .bss
array resb 81920

(asm-link is a script I wrote)

$ asm-link movnti-same-address.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 movnti-same-address.asm
+ ld -o movnti-same-address movnti-same-address.o
$ perf stat -e task-clock,cycles,instructions ./movnti-same-address 

 Performance counter stats for './movnti-same-address':

       1835.056710      task-clock (msec)         #    0.995 CPUs utilized          
     4,398,731,563      cycles                    #    2.397 GHz                    
    12,891,491,495      instructions              #    2.93  insns per cycle        
       1.843642514 seconds time elapsed

Running in parallel:

$ time ./movnti-same-address; time ./movnti-same-address & time ./movnti-same-address &

real    0m1.844s / user    0m1.828s    # running alone
[1] 12523
[2] 12524
peter@tesla:~/src/SO$ 
real    0m1.855s / user    0m1.824s    # running together
real    0m1.984s / user    0m1.808s
# output compacted by hand to save space

I expect perfect SMP scaling (except with hyperthreading), up to any number of cores. e.g. on a 10-core Xeon, 10 copies of this test could run at the same time (on separate physical cores), and each one would finish in the same time as if it was running alone. (Single-core turbo vs. multi-core turbo will also be a factor, though, if you measure wall-clock time instead of cycle counts.)

zx485's uop count nicely explains why the loop isn't bottlenecked by the frontend or unfused-domain execution resources.

However, this disproves his theory about the ratio of CPU to memory clocks having anything to do with it. Interesting coincidence, though, that the OP chose a count that happened to make the final total IPC work out that way.

P.S Is there any loop-carried-dependency?

Yes, the loop counter. (1 cycle). BTW, you could have saved an insn by counting down towards zero with dec / jg instead of counting up and having to use a cmp.

The write-after-write memory dependency isn't a "true" dependency in the normal sense, but it is something the CPU has to keep track of. The CPU doesn't "notice" that the same value is written repeatedly, so it has to make sure the last write is the one that "counts".

This is called an architectural hazard. I think the term still applies when talking about memory, rather than registers.