Search code examples
c++arraysmathpointer-arithmeticinteger-arithmetic

Subtract or add constant to large array


I have a large uint8_t array (size = 1824 * 942). I want to do the same operation to each element. Particularly I need to subtract -15 from each element.

This array is refreshed 20 times per second, so time is an issue and I'm avoiding loops over the array.

Is there an easy way to do this?


Solution

  • You can just write a function with a plain loop:

    void add(uint8_t* a, size_t a_len, uint8_t b) {
        for(uint8_t* ae = a + a_len; a < ae; ++a)
            *a += b;
    }
    

    And hope that the compiler vectorizes that for you, which it does, see assembly.

    Solutions with std::for_each and std::transform such as:

    void add(uint8_t* a, size_t a_len, uint8_t b) {
        std::transform(a, a + a_len, a, [b](auto value) { return value + b; });
    }
    

    Should generate exactly the same code, but sometimes they don't.


    [Updated]

    Out of curiosity, I benchmarked the following solutions:

    #include <benchmark/benchmark.h>
    
    #include <cstdint>
    #include <array>
    #include <algorithm>
    
    #include <immintrin.h>
    
    constexpr size_t SIZE = 1824 * 942;
    alignas(32) std::array<uint8_t, SIZE> A;
    
    __attribute__((noinline)) void add_loop(uint8_t* a, size_t a_len, uint8_t b) {
        for(uint8_t* ae = a + a_len; a < ae; ++a)
            *a += b;
    }
    
    __attribute__((noinline)) void add_loop_4way(uint8_t* a, size_t a_len, uint8_t b) {
        a_len /= 4;
        for(uint8_t* ae = a + a_len; a < ae; ++a) {
            a[a_len * 0] += b;
            a[a_len * 1] += b;
            a[a_len * 2] += b;
            a[a_len * 3] += b;
        }
    }
    
    __attribute__((noinline)) void add_transform(uint8_t* a, size_t a_len, uint8_t b) {
        std::transform(a, a + a_len, a, [b](auto value) { return value + b; });
    }
    
    inline void add_sse_(__m128i* sse_a, size_t a_len, uint8_t b) {
        __m128i sse_b = _mm_set1_epi8(b);
        for(__m128i* ae = sse_a + a_len / (sizeof *sse_a / sizeof b); sse_a < ae; ++sse_a)
            *sse_a = _mm_add_epi8(*sse_a, sse_b);
    }
    
    __attribute__((noinline)) void add_sse(uint8_t* a, size_t a_len, uint8_t b) {
        add_sse_(reinterpret_cast<__m128i*>(a), a_len, b);
    }
    
    inline void add_avx_(__m256i* avx_a, size_t a_len, uint8_t b) {
        __m256i avx_b = _mm256_set1_epi8(b);
        for(__m256i* ae = avx_a + a_len / (sizeof *avx_a / sizeof b); avx_a < ae; ++avx_a)
            *avx_a = _mm256_add_epi8(*avx_a, avx_b);
    }
    
    __attribute__((noinline)) void add_avx(uint8_t* a, size_t a_len, uint8_t b) {
        add_avx_(reinterpret_cast<__m256i*>(a), a_len, b);
    }
    
    template<decltype(&add_loop) F>
    void B(benchmark::State& state) {
        for(auto _ : state)
            F(A.data(), A.size(), 15);
    }
    
    BENCHMARK_TEMPLATE(B, add_loop);
    BENCHMARK_TEMPLATE(B, add_loop_4way);
    BENCHMARK_TEMPLATE(B, add_transform);
    BENCHMARK_TEMPLATE(B, add_sse);
    BENCHMARK_TEMPLATE(B, add_avx);
    
    BENCHMARK_MAIN();
    

    Results on i7-7700k CPU and g++-8.3 -DNDEBUG -O3 -march=native -mtune=native:

    ------------------------------------------------------------------
    Benchmark                        Time             CPU   Iterations
    ------------------------------------------------------------------
    B<add_loop>                  31589 ns        31589 ns        21981
    B<add_loop_4way>             30030 ns        30030 ns        23265
    B<add_transform>             31590 ns        31589 ns        22159
    B<add_sse>                   39993 ns        39992 ns        17403
    B<add_avx>                   31588 ns        31587 ns        22161
    

    Times for loop, transform and AVX2 versions are pretty much identical.

    SSE version is slower because the compiler generates faster AVX2 code.

    perf report reports ~50% L1d-cache miss rate which indicates that the algorithm is bottlenecked by memory access. Modern CPUs can handle multiple memory accesses simultaneously, so that you can squeeze an extra ~5% of performance here by accessing 4 regions of memory in parallel, which is what the 4-way loop does (for your particular array size 4 ways is the fastest). See Memory-level parallelism: Intel Skylake versus Intel Cannonlake for more details.