I have GCC 9.2 compiler. If i use MMX or SSE/AVX extension you will have you code run in parallel, so it will be faster. How to tell the compiler to use this instructions I have a code snippet i want to parallel:
char max(char * a, int n){
char max = (*a);
for (int i = 0 ; i< n ; ++i){
if (max < a[i]){
max = a[i];
}
}
return max;
}
it generates code using SSE extension but don't use pmaxub why
SSE2 is baseline for x86-64, so yes pmaxub
is available.
But your code uses char
, and char
= signed char
in the x86-64 System V ABI, and Windows x64. Perhaps you're coming from ARM where char
= unsigned char
? The ISO C standard leaves the signedness of char implementation-defined so it's a terrible idea to rely on it for correctness (or performance in this case).
If you use uint8_t
like a normal person, you get the expected inner loop from GCC9.2 -O3
for x86-64, even without using -march=skylake
or anything to enable AVX2. (Godbolt)
.L14:
movdqu xmm2, XMMWORD PTR [rax]
add rax, 16
pmaxub xmm0, xmm2
cmp rax, rdx
jne .L14
pmaxsb
requires SSE4.1. (SSE2 is highly non-orthogonal like MMX, with some operations only available for some combinations of size and signedness, targeting specific applications like audio DSP and graphics pixels. SSE4.1 filled in many of the gaps.)
If you enable it, GCC and clang use it.
With just -O3
and the baseline x86-64 -march
default (and -mtune=generic
), GCC auto-vectorizes with pcmpgtb
(which is a signed compare) then manual blend using pand
/pandn
/por
and the requisite extra movdqa
copying that entails. pcmpgtb
is your hint that your code as written needs signed compare, not unsigned. Clang does the same thing.
.L5:
movdqu xmm1, XMMWORD PTR [rax]
add rax, 16
movdqa xmm2, xmm1
pcmpgtb xmm2, xmm0
pand xmm1, xmm2
pandn xmm2, xmm0
movdqa xmm0, xmm2
por xmm0, xmm1
cmp rax, rdx
jne .L5
GCC could have auto-vectorized by range-shifting inputs to unsigned for pmaxub
, then range-shifting back to signed outside the loop, by adding/subtracting 128. (i.e. pxor
with _mm_set1_epi8(0x80)
). So this is a big missed-optimization for this case which could have kept the critical-path latency down to 1 cycle, just pmaxub
.
But of course if you actually enable SSE4.1, you get pmaxsb
. Or AVX2 vpmaxsb
.
You could use -msse4.1
or -mavx2
but usually you want to enable other extensions that more recent CPUs have, too, and set tuning settings. Especially for AVX2, you don't want to tune for Sandybridge and older CPUs because SnB doesn't even have AVX2. You don't want split unaligned loads and stuff like that. Also, AVX2 CPUs normally also have BMI2, popcnt, and other goodies.
Use -march=haswell
or -march=znver1
(Zen). Or for local use, -march=native
to optimize for your CPU. (It's identical to using -march=skylake
if you have a Skylake, unless maybe it detects your specific L3 cache size or something.)