I have a large uint8_t
array (size = 1824 * 942). I want to do the same operation to each element. Particularly I need to subtract -15 from each element.
This array is refreshed 20 times per second, so time is an issue and I'm avoiding loops over the array.
Is there an easy way to do this?
You can just write a function with a plain loop:
void add(uint8_t* a, size_t a_len, uint8_t b) {
for(uint8_t* ae = a + a_len; a < ae; ++a)
*a += b;
}
And hope that the compiler vectorizes that for you, which it does, see assembly.
Solutions with std::for_each
and std::transform
such as:
void add(uint8_t* a, size_t a_len, uint8_t b) {
std::transform(a, a + a_len, a, [b](auto value) { return value + b; });
}
Should generate exactly the same code, but sometimes they don't.
[Updated]
Out of curiosity, I benchmarked the following solutions:
#include <benchmark/benchmark.h>
#include <cstdint>
#include <array>
#include <algorithm>
#include <immintrin.h>
constexpr size_t SIZE = 1824 * 942;
alignas(32) std::array<uint8_t, SIZE> A;
__attribute__((noinline)) void add_loop(uint8_t* a, size_t a_len, uint8_t b) {
for(uint8_t* ae = a + a_len; a < ae; ++a)
*a += b;
}
__attribute__((noinline)) void add_loop_4way(uint8_t* a, size_t a_len, uint8_t b) {
a_len /= 4;
for(uint8_t* ae = a + a_len; a < ae; ++a) {
a[a_len * 0] += b;
a[a_len * 1] += b;
a[a_len * 2] += b;
a[a_len * 3] += b;
}
}
__attribute__((noinline)) void add_transform(uint8_t* a, size_t a_len, uint8_t b) {
std::transform(a, a + a_len, a, [b](auto value) { return value + b; });
}
inline void add_sse_(__m128i* sse_a, size_t a_len, uint8_t b) {
__m128i sse_b = _mm_set1_epi8(b);
for(__m128i* ae = sse_a + a_len / (sizeof *sse_a / sizeof b); sse_a < ae; ++sse_a)
*sse_a = _mm_add_epi8(*sse_a, sse_b);
}
__attribute__((noinline)) void add_sse(uint8_t* a, size_t a_len, uint8_t b) {
add_sse_(reinterpret_cast<__m128i*>(a), a_len, b);
}
inline void add_avx_(__m256i* avx_a, size_t a_len, uint8_t b) {
__m256i avx_b = _mm256_set1_epi8(b);
for(__m256i* ae = avx_a + a_len / (sizeof *avx_a / sizeof b); avx_a < ae; ++avx_a)
*avx_a = _mm256_add_epi8(*avx_a, avx_b);
}
__attribute__((noinline)) void add_avx(uint8_t* a, size_t a_len, uint8_t b) {
add_avx_(reinterpret_cast<__m256i*>(a), a_len, b);
}
template<decltype(&add_loop) F>
void B(benchmark::State& state) {
for(auto _ : state)
F(A.data(), A.size(), 15);
}
BENCHMARK_TEMPLATE(B, add_loop);
BENCHMARK_TEMPLATE(B, add_loop_4way);
BENCHMARK_TEMPLATE(B, add_transform);
BENCHMARK_TEMPLATE(B, add_sse);
BENCHMARK_TEMPLATE(B, add_avx);
BENCHMARK_MAIN();
Results on i7-7700k CPU and g++-8.3 -DNDEBUG -O3 -march=native -mtune=native
:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
B<add_loop> 31589 ns 31589 ns 21981
B<add_loop_4way> 30030 ns 30030 ns 23265
B<add_transform> 31590 ns 31589 ns 22159
B<add_sse> 39993 ns 39992 ns 17403
B<add_avx> 31588 ns 31587 ns 22161
Times for loop, transform and AVX2 versions are pretty much identical.
SSE version is slower because the compiler generates faster AVX2 code.
perf report
reports ~50% L1d-cache miss rate which indicates that the algorithm is bottlenecked by memory access. Modern CPUs can handle multiple memory accesses simultaneously, so that you can squeeze an extra ~5% of performance here by accessing 4 regions of memory in parallel, which is what the 4-way loop does (for your particular array size 4 ways is the fastest). See Memory-level parallelism: Intel Skylake versus Intel Cannonlake for more details.