Search code examples
cgcccompiler-optimizationavx

gcc: Optimize single function with `-mavx -mprefer-avx128`


I want to optimize a single function with -mavx -mprefer-avx128. Basically none of the code shall use AVX, except for one of the functions: This one should use AVX128.

I tried these things:

__attribute__((target("avx")))
void f() { ... }

=> seems to use avx2

__attribute__((target("prefer-avx128")))
void f() { ... }

=> does not compile

__attribute__((target("avx")))
__attribute__((optimize("prefer-avx128")))
void f() { ... }

=> does not compile

Maybe someone knows how this can be done?


Solution

  • -mprefer-avx128, and its modern replacement -mprefer-vector-width=128, are -m options, not -f, so they can only possibly work with target("string") rather than optimize("string") attributes.

    But actually only some -m options work as attributes; the GCC manual's list of x86 target attributes is mostly ISA extensions, arch= and tune=, but also includes prefer-vector-width=OPT. There isn't one based on the older option -mprefer-avx128; probably support for an attribute was added after -mprefer-avx128 was obsoleted in favour of the -mprefer-vector-width option.

    __attribute__((target("avx,prefer-vector-width=128")))
    

    That enables AVX (AVX1 only, not AVX2), and tunes for 128-bit auto-vectorization. Since integer code is easier to auto-vectorize, I actually tested with AVX2:

    __attribute__((target("avx2,prefer-vector-width=128")))
    unsigned foo(unsigned *arr){
        unsigned sum=0;
        for(int i=0 ; i<10240; i++) {
            sum += arr[i];
        }
        return sum;
    }
    
    __attribute__((target("avx2")))
    unsigned bar(unsigned *arr){
        unsigned sum=0;
        for(int i=0 ; i<10240; i++) {
            sum += arr[i];
        }
        return sum;
    }
    

    Compiled with gcc -O3 -mtune=haswell (Godbolt), the first version uses vpaddd xmm, the second uses vpaddd ymm. (tune=haswell sets the normal vector-width preference to 256.)


    Terminology: AVX1 supports 256-bit vector width for FP operations like vaddps.
    AVX2 is 256-bit integer operations like vpaddb ymm, and lane-crossing shuffles with granularity finer than 128-bit like vpermps / vpermq.

    __attribute__((target("avx"))) will definitely not use AVX2 instructions if you didn't already enable them on the command line or with an earlier #pragma GCC target