Why does the same for loop runs faster in the second time?

Initially I was comparing performance of built-in D arrays and plain pointers, but I've ended up with a different issue. For some reason if I run two identical for loops one after another, the second one always completes faster.

Here is the code:

import std.stdio : writeln;
import std.datetime : StopWatch;
import core.stdc.stdlib : malloc, free;

void main()
{
    immutable N = 1_000_000_000;
    StopWatch sw;

    uint* ptr = cast(uint*)malloc(uint.sizeof * N);

    sw.start();
    for (uint i = 0; i < N; ++i)
        ptr[i] = 1;
    sw.stop();
    writeln("the first for loop time: ", sw.peek().msecs(), " msecs");
    sw.reset();

    sw.start();
    for (uint i = 0; i < N; ++i)
        ptr[i] = 2;
    sw.stop();
    writeln("the second for loop time: ", sw.peek().msecs(), " msecs");
    sw.reset();

    free(ptr);
}

After compiling and running it with dmd -release -O -noboundscheck -inline test.d -of=test && ./test it prints:

the first for loop time: 1253 msecs
the second for loop time: 357 msecs

I was not sure whether this is something related to D or dmd so I've rewritten this code in C++:

#include <iostream>
#include <chrono>

int main()
{
    const unsigned int N = 1000000000;

    unsigned int* ptr = (unsigned int*)malloc(sizeof(unsigned int) * N);

    auto start = std::chrono::high_resolution_clock::now();
    for (uint i = 0; i < N; ++i)
        ptr[i] = 1;
    auto finish = std::chrono::high_resolution_clock::now();
    auto milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
    std::cout << "the first for loop time: " << milliseconds.count() << " msecs" << std::endl;

    start = std::chrono::high_resolution_clock::now();
    for (uint i = 0; i < N; ++i)
        ptr[i] = 2;
    finish = std::chrono::high_resolution_clock::now();
    milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
    std::cout << "the second for loop time: " << milliseconds.count() << " msecs" << std::endl;

    free(ptr);
}

and g++ -O3 test.cpp -o test && ./test gives similar output:

the first for loop time: 1029 msecs
the second for loop time: 349 msecs

The result is the same every time I run this code. Allocated data is too big to be cached. There is no branching points so no branching prediction issues should be involved. The memory is accessed at the same direct order at both loops so I guess this should not be related to the memory layout.

So why does the second one runs faster than the first one?

Solution

Because uint* ptr = cast(uint*)malloc(uint.sizeof * N); does not allocate memory until you do for loop over many elements. You can test it:

import core.stdc.stdlib : malloc, free;

void main()
{
    immutable N = 1_000_000_000;
    uint* ptr = cast(uint*)malloc(uint.sizeof * N);

    foreach (_; 0 .. 100)
    for (uint i = 0; i < N; ++i)
        ptr[N-1] = 1;

    // until this point almost no memory is allocated
    for (uint i = 0; i < N; ++i)
        ptr[i] = 2;

    free(ptr);
}

Update @Eljay allready explain this in comments