gcc: Optimize single function with `-mavx -mprefer-avx128`

I want to optimize a single function with -mavx -mprefer-avx128. Basically none of the code shall use AVX, except for one of the functions: This one should use AVX128.

I tried these things:

__attribute__((target("avx")))
void f() { ... }

=> seems to use avx2

__attribute__((target("prefer-avx128")))
void f() { ... }

=> does not compile

__attribute__((target("avx")))
__attribute__((optimize("prefer-avx128")))
void f() { ... }

=> does not compile

Maybe someone knows how this can be done?

Solution

-mprefer-avx128, and its modern replacement -mprefer-vector-width=128, are -m options, not -f, so they can only possibly work with target("string") rather than optimize("string") attributes.

But actually only some -m options work as attributes; the GCC manual's list of x86 target attributes is mostly ISA extensions, arch= and tune=, but also includes prefer-vector-width=OPT. There isn't one based on the older option -mprefer-avx128; probably support for an attribute was added after -mprefer-avx128 was obsoleted in favour of the -mprefer-vector-width option.

__attribute__((target("avx,prefer-vector-width=128")))

That enables AVX (AVX1 only, not AVX2), and tunes for 128-bit auto-vectorization. Since integer code is easier to auto-vectorize, I actually tested with AVX2:

__attribute__((target("avx2,prefer-vector-width=128")))
unsigned foo(unsigned *arr){
    unsigned sum=0;
    for(int i=0 ; i<10240; i++) {
        sum += arr[i];
    }
    return sum;
}

__attribute__((target("avx2")))
unsigned bar(unsigned *arr){
    unsigned sum=0;
    for(int i=0 ; i<10240; i++) {
        sum += arr[i];
    }
    return sum;
}

Compiled with gcc -O3 -mtune=haswell (Godbolt), the first version uses vpaddd xmm, the second uses vpaddd ymm. (tune=haswell sets the normal vector-width preference to 256.)

Terminology: AVX1 supports 256-bit vector width for FP operations like vaddps.
AVX2 is 256-bit integer operations like vpaddb ymm, and lane-crossing shuffles with granularity finer than 128-bit like vpermps / vpermq.

__attribute__((target("avx"))) will definitely not use AVX2 instructions if you didn't already enable them on the command line or with an earlier #pragma GCC target