Search code examples
c++performancessesimdavx

SSE-copy, AVX-copy and std::copy performance


I'm tried to improve performance of copy operation via SSE and AVX:

    #include <immintrin.h>

    const int sz = 1024;
    float *mas = (float *)_mm_malloc(sz*sizeof(float), 16);
    float *tar = (float *)_mm_malloc(sz*sizeof(float), 16);
    float a=0;
    std::generate(mas, mas+sz, [&](){return ++a;});
    
    const int nn = 1000;//Number of iteration in tester loops    
    std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2, start3, end3; 
    
    //std::copy testing
    start1 = std::chrono::system_clock::now();
    for(int i=0; i<nn; ++i)
        std::copy(mas, mas+sz, tar);
    end1 = std::chrono::system_clock::now();
    float elapsed1 = std::chrono::duration_cast<std::chrono::microseconds>(end1-start1).count();
    
    //SSE-copy testing
    start2 = std::chrono::system_clock::now();
    for(int i=0; i<nn; ++i)
    {
        auto _mas = mas;
        auto _tar = tar;
        for(; _mas!=mas+sz; _mas+=4, _tar+=4)
        {
           __m128 buffer = _mm_load_ps(_mas);
           _mm_store_ps(_tar, buffer);
        }
    }
    end2 = std::chrono::system_clock::now();
    float elapsed2 = std::chrono::duration_cast<std::chrono::microseconds>(end2-start2).count();
     
    //AVX-copy testing
    start3 = std::chrono::system_clock::now();
    for(int i=0; i<nn; ++i)
    {
        auto _mas = mas;
        auto _tar = tar;
        for(; _mas!=mas+sz; _mas+=8, _tar+=8)
        {
           __m256 buffer = _mm256_load_ps(_mas);
           _mm256_store_ps(_tar, buffer);
        }
    }
    end3 = std::chrono::system_clock::now();
    float elapsed3 = std::chrono::duration_cast<std::chrono::microseconds>(end3-start3).count();
    
    std::cout<<"serial - "<<elapsed1<<", SSE - "<<elapsed2<<", AVX - "<<elapsed3<<"\nSSE gain: "<<elapsed1/elapsed2<<"\nAVX gain: "<<elapsed1/elapsed3;
    
    _mm_free(mas);
    _mm_free(tar);

It works. However, while the number of iterations in tester-loops - nn - increases, performance gain of simd-copy decreases:

nn=10: SSE-gain=3, AVX-gain=6;

nn=100: SSE-gain=0.75, AVX-gain=1.5;

nn=1000: SSE-gain=0.55, AVX-gain=1.1;

Can anybody explain what is the reason of mentioned performance decrease effect and is it advisable to manually vectorization of copy operation?


Solution

  • The problem is that your test does a poor job to migrate some factors in the hardware that make benchmarking hard. To test this, I've made my own test case. Something like this:

    for blah blah:
        sleep(500ms)
        std::copy
        sse
        axv
    

    output:

    SSE: 1.11753x faster than std::copy
    AVX: 1.81342x faster than std::copy
    

    So in this case, AVX is a bunch faster than std::copy. What happens when I change to test case to..

    for blah blah:
        sleep(500ms)
        sse
        axv
        std::copy
    

    Notice that absolutely nothing changed, except the order of the tests.

    SSE: 0.797673x faster than std::copy
    AVX: 0.809399x faster than std::copy
    

    Woah! how is that possible? The CPU takes a while to ramp up to full speed, so tests that are run later have an advantage. This question has 3 answers now, including an 'accepted' answer. But only the one with the lowest amount of upvotes was on the right track.

    This is one of the reasons why benchmarking is hard and you should never trust anyone's micro-benchmarks unless they've included detailed information of their setup. It isn't just the code that can go wrong. Power saving features and weird drivers can completely mess up your benchmark. One time i've measured an factor 7 difference in performance by toggling a switch in the bios that less than 1% of notebooks offer.