c++visual-c++c++20 simd expression-templates

Dynamic dispatching of different SIMD implementations in header-only code. Possible at all?

I'm in the process of planing a vector math library. It should work based on expression templates to express chained math operations on vectors, like e.g.

Vector b = foo (bar (a));

where

a is a source vector
bar returns an instance of an expression template that supplies functions transform each element of a
foo returns an instance of an expression template that supplies functions that transform each element of the expression fed into it (bar (a))
Vector::operator= actually invokes the expression returned by foo in a for-loop and stores it to b

Well, the basic example of an expression template so far.

However, the plan is that the templates should not only supply the implementation for an per-element transformation in an operator[] fashion, but should also supply the implementation for a vectorised version of that. At best, the template returned by foo supplies a fully inlined function that loads N consecutive elements of a into a SIMD register and executes the chained operations of bar and foo using SIMD instructions and returns a SIMD register as return value.

This should also be not too hard to implement so far.

Now, on x86_64 CPUs, I'd love to optionally use AVX if available, but I want the implementation to invoke the AVX based implementation only if the CPU we currently run on supports AVX. Otherwise an SSE based implementation should be used, or a non-SIMD fallback as last resort. Usually in cases like this, the various implementations would be put into different translation units, each being compiled with the according vector instruction flags and then the runtime code would contain some dispatching logic.

However with the expression template approach, the chain of instructions to be executed will be defined in the application code, that actually invokes some chained expression templates which translate to a very specific template instantiation. I explicitly want the compiler to be able to fully inline the chain of expressions to gain maximum performance, therefore all the implementation should be header-only. But compiling the whole user code with e.g. an AVX flag would be a bad idea, since the compiler would probably insert some AVX instructions in other parts of the code, that might not be supported at runtime.

So, is there any clever solution for that? Basically, I'm looking for a way to force the compiler to only generate some SIMD instructions for e.g. a function template scope, like in this piece of pseudocode (I know that the code in Vector::operator= does not reflect alignment etc. and you have to imagine that I have some fancy C++ wrapper for the native SIMD registers – just want to point out the core question here)

template <class T, class Src>
struct MyFancyExpressionTemplate
{

    // Compile this function with AVX
    forcedinline SIMDRegister<T, 128> eval128 (size_t i)
    {
        // SSE implementation here
    }

    // compile this function with AVX
    forcedinline SIMDRegister<T, 256> eval256 (size_t i)
    {
        // AVX implementation here
    }

}

template <class T>
class Vector
{
public:
    // A whole lot of functions
 
    template <class Expression>
    requires isExpressionTemplate<Expression>
    void operator= (Expression&& e)
    {
        const auto n = size();

        if (avxIsAvailable)
        {
            // Compile this part with AVX
            for (size_t i = 0; i < n; i+= SIMDRegister<T, 256>::numElements)
                e.eval256 (i).store (mem + i);
            return;
        }

        if (sseIsAvailable)
        {
            // Compile this part with SSE
            for (size_t i = 0; i < n; i+= SIMDRegister<T, 128>::numElements)
                e.eval128 (i).store (mem + i);
            return;
        }

        for (size_t i = 0; i < size(); ++i)
            mem[i] = e[i];
    }
    
    // A whole lot of further functions
}

To my knowledge this is not possible, but I might be overlooking some fancy #pragma or some trick to re-organize my code to make that work. Any idea for completely different approaches to the problem, that supports the goal of giving the compiler room to inline the whole chain of SIMD operations is greatly appreciated.

We are targeting (Apple) Clang 13+ and MSVC 2022, but thinking of switching to Clang for Windows as well. We use C++ 20.

Solution

To have a proper answer, as I already noted in the comments, on clang (and gcc) you can use multiversioning to achieve it, i.e. function attributes such as __attribute__((target("AVX"))). The attribute is also supported by clang. So it can look like this:

template <class T, class Src>
struct MyFancyExpressionTemplate
{
    // Compile this function with SSE
    SIMDRegister<T, 128> eval128 (size_t i) __attribute__((target("SSE2")))
    {
        // SSE implementation here
    }

    // compile this function with AVX
    SIMDRegister<T, 256> eval256 (size_t i) __attribute__((target("AVX")))
    {
        // AVX implementation here
    }
}

But note that this will prevent the compiler from inlining the functions into other functions that do not have the same target options. So the functions containing the loops that call these functions should probably also have these targets set.

On MSVC, something similar is neither available nor required. The /arch compiler flag just tells the compiler to use the instruction set everywhere. If you do not enable AVX and then use the corresponding intrinsics, you are responsible to ensure that the machine actually supports AVX when it attempts to execute the code.

There are other posts on stackoverflow with more information, so I suggest you have a look at those. For example this, this or this post.