Why same memcpy glibc implementations is faster on Linux and slower on Windows?

I noticed that memcpy is faster on Linux than Windows on the same hardware. I dual boot same box with Intel i7 4770 CPU and 16Gb RAM and run same c++ code compiled. I'm trying to bench memccpy with this code

#include <iostream>
#include <chrono>
#include <cstring>

typedef std::chrono::high_resolution_clock Clock;

int main() {
    const int mb = 300;
    int size = mb * 1024 * 1024 / sizeof(int);
    auto buffer = new int[size];
    srand(1);

    for(int i = 0; i < size; i++) {
        auto r = abs(rand()) % 2048;
        buffer[i] = std::max<int>(r, 1);
    }

    auto buffer2 = new int[size];

    const int repeats = 100;
    for (int j = 2; j < mb; j+=2) {
        auto start = Clock::now();

        // Copy j Mb
        int size = j * 1024 * 1024 / sizeof(int);
        for (int i = 0; i < repeats; i++) {
            int offset = 0;
            while (offset < size) {
                // Run memcpy on random sizes
                int copySize = buffer[offset];
                memcpy(buffer2, buffer, copySize * sizeof(int));
                offset += copySize;
            }
        }

        auto end = Clock::now();
        auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start).count();
        // Time taken per 1Mb
        std::cout << j << "," << diff / j / repeats  << std::endl;
    }
}

Linux execution is 10% faster on average. It takes on average 20 micro sec/Mb on Linux and 22 micro sec/Mb on Windows. It is compiled in both case with gcc 10.2 m64 -O3 -mavx flags. The project I'm working on is OS database and there I see even bigger effects on memcpy and memset being faster on Linux with around 30% speedup on random length copies of small buffers.

Any idea why memcpy on Windows is different to Linux? I'd expect that memcpy is written on Assembly language and does not depend on OS but only on CPU architecture.

Solution

memcpy is part of the standard C library, and as such, is provided by the operating system on which you run your code (or an alternative provider if you use a different libc). For small copies of known sizes, GCC will often inline these operations because it can often avoid the overhead of a function call, but for large or unknown sizes, it will often use the system function.

In this case, you're seeing that glibc and Windows have different implementations, and glibc provides a better option. glibc does provide several different variants on different platforms based on what works best for a given CPU, but Windows may not do so, or may have a less optimized implementation.

In the past, glibc has even taken advantage of the fact that memcpy cannot have overlapping arguments and copied backwards on some CPUs, but that unfortunately broke some programs which did not comply with the standard, notably Adobe Flash Player. However, such an implementation was permissible and was indeed faster.

Instead of memcpy being slower, you could be finding that Windows has a different memory handling strategy. For example, it is common not to fault in all of the memory when it is first allocated. You may be finding that Linux, which in some cases will prefault subsequent pages, may be performing better here because of that optimization or a different one. If Windows has chosen not to do that, it could be because it complicates the code, or because it doesn't perform well on real-world use cases that are commonly run on Windows. What performs well in a synthetic benchmark may or may not match what performs well in the real world.

Ultimately, this is a quality of implementation issue. The standard requires that the functions it specifies behave in a specified way, and doesn't specify performance characteristics. Some projects choose to include optimized memcpy implementations if performance of that function is very important to them. Others choose not to and prefer to advise users to choose a platform that best meets their needs, taking into account that some platforms may perform better than others.