gcc assembly compiler-optimization sse auto-vectorization

MMX SSE extensions for for loop

I have GCC 9.2 compiler. If i use MMX or SSE/AVX extension you will have you code run in parallel, so it will be faster. How to tell the compiler to use this instructions I have a code snippet i want to parallel:

char max(char * a, int n){
    char max = (*a);
    for (int i = 0 ; i< n ; ++i){
            if (max < a[i]){
                max = a[i];
            }
    }
    return max;
}

it generates code using SSE extension but don't use pmaxub why

Solution

SSE2 is baseline for x86-64, so yes pmaxub is available.

But your code uses char, and char = signed char in the x86-64 System V ABI, and Windows x64. Perhaps you're coming from ARM where char = unsigned char? The ISO C standard leaves the signedness of char implementation-defined so it's a terrible idea to rely on it for correctness (or performance in this case).

If you use uint8_t like a normal person, you get the expected inner loop from GCC9.2 -O3 for x86-64, even without using -march=skylake or anything to enable AVX2. (Godbolt)

.L14:
        movdqu  xmm2, XMMWORD PTR [rax]
        add     rax, 16
        pmaxub  xmm0, xmm2
        cmp     rax, rdx
        jne     .L14

pmaxsb requires SSE4.1. (SSE2 is highly non-orthogonal like MMX, with some operations only available for some combinations of size and signedness, targeting specific applications like audio DSP and graphics pixels. SSE4.1 filled in many of the gaps.)

If you enable it, GCC and clang use it.

With just -O3 and the baseline x86-64 -march default (and -mtune=generic), GCC auto-vectorizes with pcmpgtb (which is a signed compare) then manual blend using pand/pandn/por and the requisite extra movdqa copying that entails. pcmpgtb is your hint that your code as written needs signed compare, not unsigned. Clang does the same thing.

.L5:
        movdqu  xmm1, XMMWORD PTR [rax]
        add     rax, 16
        movdqa  xmm2, xmm1
        pcmpgtb xmm2, xmm0
        pand    xmm1, xmm2
        pandn   xmm2, xmm0
        movdqa  xmm0, xmm2
        por     xmm0, xmm1
        cmp     rax, rdx
        jne     .L5

GCC could have auto-vectorized by range-shifting inputs to unsigned for pmaxub, then range-shifting back to signed outside the loop, by adding/subtracting 128. (i.e. pxor with _mm_set1_epi8(0x80)). So this is a big missed-optimization for this case which could have kept the critical-path latency down to 1 cycle, just pmaxub.

But of course if you actually enable SSE4.1, you get pmaxsb. Or AVX2 vpmaxsb.

You could use -msse4.1 or -mavx2 but usually you want to enable other extensions that more recent CPUs have, too, and set tuning settings. Especially for AVX2, you don't want to tune for Sandybridge and older CPUs because SnB doesn't even have AVX2. You don't want split unaligned loads and stuff like that. Also, AVX2 CPUs normally also have BMI2, popcnt, and other goodies.

Use -march=haswell or -march=znver1 (Zen). Or for local use, -march=native to optimize for your CPU. (It's identical to using -march=skylake if you have a Skylake, unless maybe it detects your specific L3 cache size or something.)