SSE-copy, AVX-copy and std::copy performance

I'm tried to improve performance of copy operation via SSE and AVX:

    #include <immintrin.h>

    const int sz = 1024;
    float *mas = (float *)_mm_malloc(sz*sizeof(float), 16);
    float *tar = (float *)_mm_malloc(sz*sizeof(float), 16);
    float a=0;
    std::generate(mas, mas+sz, [&](){return ++a;});
    
    const int nn = 1000;//Number of iteration in tester loops    
    std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2, start3, end3; 
    
    //std::copy testing
    start1 = std::chrono::system_clock::now();
    for(int i=0; i<nn; ++i)
        std::copy(mas, mas+sz, tar);
    end1 = std::chrono::system_clock::now();
    float elapsed1 = std::chrono::duration_cast<std::chrono::microseconds>(end1-start1).count();
    
    //SSE-copy testing
    start2 = std::chrono::system_clock::now();
    for(int i=0; i<nn; ++i)
    {
        auto _mas = mas;
        auto _tar = tar;
        for(; _mas!=mas+sz; _mas+=4, _tar+=4)
        {
           __m128 buffer = _mm_load_ps(_mas);
           _mm_store_ps(_tar, buffer);
        }
    }
    end2 = std::chrono::system_clock::now();
    float elapsed2 = std::chrono::duration_cast<std::chrono::microseconds>(end2-start2).count();
     
    //AVX-copy testing
    start3 = std::chrono::system_clock::now();
    for(int i=0; i<nn; ++i)
    {
        auto _mas = mas;
        auto _tar = tar;
        for(; _mas!=mas+sz; _mas+=8, _tar+=8)
        {
           __m256 buffer = _mm256_load_ps(_mas);
           _mm256_store_ps(_tar, buffer);
        }
    }
    end3 = std::chrono::system_clock::now();
    float elapsed3 = std::chrono::duration_cast<std::chrono::microseconds>(end3-start3).count();
    
    std::cout<<"serial - "<<elapsed1<<", SSE - "<<elapsed2<<", AVX - "<<elapsed3<<"\nSSE gain: "<<elapsed1/elapsed2<<"\nAVX gain: "<<elapsed1/elapsed3;
    
    _mm_free(mas);
    _mm_free(tar);

It works. However, while the number of iterations in tester-loops - nn - increases, performance gain of simd-copy decreases:

nn=10: SSE-gain=3, AVX-gain=6;

nn=100: SSE-gain=0.75, AVX-gain=1.5;

nn=1000: SSE-gain=0.55, AVX-gain=1.1;

Can anybody explain what is the reason of mentioned performance decrease effect and is it advisable to manually vectorization of copy operation?

Solution

The problem is that your test does a poor job to migrate some factors in the hardware that make benchmarking hard. To test this, I've made my own test case. Something like this:

for blah blah:
    sleep(500ms)
    std::copy
    sse
    axv

output:

SSE: 1.11753x faster than std::copy
AVX: 1.81342x faster than std::copy

So in this case, AVX is a bunch faster than std::copy. What happens when I change to test case to..

for blah blah:
    sleep(500ms)
    sse
    axv
    std::copy

Notice that absolutely nothing changed, except the order of the tests.

SSE: 0.797673x faster than std::copy
AVX: 0.809399x faster than std::copy

Woah! how is that possible? The CPU takes a while to ramp up to full speed, so tests that are run later have an advantage. This question has 3 answers now, including an 'accepted' answer. But only the one with the lowest amount of upvotes was on the right track.

This is one of the reasons why benchmarking is hard and you should never trust anyone's micro-benchmarks unless they've included detailed information of their setup. It isn't just the code that can go wrong. Power saving features and weird drivers can completely mess up your benchmark. One time i've measured an factor 7 difference in performance by toggling a switch in the bios that less than 1% of notebooks offer.