Search code examples

Organizing multiple implementations (for SIMD)

This is admittedly an open-ended/subjective question but I am looking for different ideas on how to "organize" multiple alternative implementations of the same functions.

I have a set of several functions that each have platform-specific implementations. Specifically, they each have a different implementation for a particular SIMD type: NEON (64-bit), NEON (128-bit), SSE3, AVX2, etc (and one non-SIMD implementation).

All functions have a non-SIMD implementation. Not all functions are specialized for each SIMD type.

Currently, I have one monolithic file that uses a mess of #ifdefs to implement the particular SIMD specializations. It worked when we were only specializing a few of the functions to one or two SIMD types. Now, it's become unwieldy.

Effectively, I need something that functions like a virtual/override. The non-SIMD implementations are implemented in a base class and SIMD specializations (if any) would override them. But I don't want actual runtime polymorphism. This code is performance critical and many of the functions can (and should) be inlined.

Something along these lines would accomplish what I need (which is still a mess of #ifdefs).

// functions.h

void function1();
void function2();

#ifdef __ARM_NEON
#include "functions_neon64.h"
#elif __SSE3__
#include "functions_sse3.h"

#include "functions_unoptimized.h"
// functions_neon64.h
void function1() {
  // NEON64 implementation
// functions_sse3.h
void function2() {
  // SSE3 implementation
// functions_unoptimized.h
void function1() {
  // Non-SIMD implementation

void function2() {
  // Non-SIMD implementation

Anyone have any better ideas?


  • The following are just some ideas that i came up with while thinking about it - there might be better solutions that i'm not aware of.

    1. Tag-Dispatch

    Using Tag-Dispatch you can define an order in which the functions should be considered by the compiler, e.g. in this case it's

    AVX2 -> SSE3 -> Neon128 -> Neon64 -> None

    The first implementation that's present in this chain will be used: godbolt example

     ** functions.h *******************
    struct SIMD_None_t {};
    struct SIMD_Neon64_t : SIMD_None_t {};
    struct SIMD_Neon128_t : SIMD_Neon64_t {};
    struct SIMD_SSE3_t : SIMD_Neon128_t {};
    struct SIMD_AVX2_t : SIMD_SSE3_t {};
    struct SIMD_Any_t : SIMD_AVX2_t  {};
    #include "functions_unoptimized.h"
    #ifdef __ARM_NEON
    #include "functions_neon64.h"
    #ifdef __SSE3__
    #include "functions_see3.h"
    // etc...
    #include "functions_stubs.h"
     ** functions_unoptimized.h *******
    inline int add(int a, int b, SIMD_None_t) {
        std::cout << "NONE" << std::endl;
        return a + b;
     ** functions_neon64.h ************
    inline int add(int a, int b, SIMD_Neon64_t) {
        std::cout << "NEON!" << std::endl;
        return a + b;
     ** functions_neon128.h ***********
    inline int add(int a, int b, SIMD_Neon128_t) {
        std::cout << "NEON128!" << std::endl;
        return a + b;
     ** functions_stubs.h ************* 
    inline int add(int a, int b) {
        return add(a, b, SIMD_Any_t{});
     ** main.cpp **********************
    #include "functions.h"
    int main() {
        add(1, 2);

    This would output NEON128!, since that's the best match in this case.


    • no #ifdef's needed in the implementation header files
    • callers don't need to be modified


    • You'll need to add an extra argument to each implementation
    • A dispatch-function is required to supply the extra argument
      (You could theretically get rid of this function by adding , SIMD_Any_t{} everywhere you call the function, but that's a lot of work)

    2. Put the functions into classes and use name lookup to pick the right function


    struct None { inline static int add(int a, int b) { return a + b; } };
    struct Neon64 : None { inline static int add(int a, int b) { return a + b; } };
    struct Neon128 : Neon64 {};
    struct SIMD : Neon128 {};
    // Usage:
    int r = SIMD::add(1, 2);

    Because child classes can hide members of their base-classes this is not ambiguos. (always the most-derived class that implements the given method is the one that will be called, so you can order your implementations)

    For your example it could look like this: godbolt example

    #include <iostream>
     ** functions.h *******************
    #include "functions_unoptimized.h"
    #ifdef __ARM_NEON
    #include "functions_neon64.h"
      struct SIMD_Neon64 : SIMD_None {};
    #ifdef __ARM_NEON_128
    #include "functions_neon128.h"
      struct SIMD_Neon128 : SIMD_Neon64 {};
    // etc...
    struct SIMD : SIMD_Neon128 {};
     ** functions_unoptimized.h *******
    struct SIMD_None {
        inline static int sub(int a, int b) {
            std::cout << "NONE" << std::endl;
            return a - b;
     ** functions_neon64.h ************
    struct SIMD_Neon64 : SIMD_None {
        inline static int sub(int a, int b) {
            std::cout << "Neon64" << std::endl;
            return a - b;
     ** functions_neon128.h ***********
    struct SIMD_Neon128 : SIMD_Neon64 {
        inline static int sub(int a, int b) {
            std::cout << "Neon128" << std::endl;
            return a - b;
     ** main.cpp **********************
    #include "functions.h"
    int main() {
        SIMD::sub(2, 3);

    This would output Neon128.


    • No #ifdef's needed in the implementation header files
    • No dispatch function required, the compiler will automatically pick the best one
    • No extra function parameters required


    • You need to change all calls to the functions & prefix them with SIMD::
    • You need to wrap all the functions inside struct's & use inheritance, so it's a bit involved

    3. Using template specializations

    If you have an enum of all possible SIMD implementations, e.g.:

    enum class SIMD_Type {
        Min, // Dummy Value -> No Implementation found
        Max // Dummy Value -> Search downwards from here

    You can use it to (recursively) walk through them until you find one that has been specialized, e.g:

    template<SIMD_Type type = SIMD_Type::Max>
    inline int add(int a, int b) {
        constexpr SIMD_Type nextType = static_cast<SIMD_Type>(static_cast<int>(type) - 1);
        return add<nextType>(a, b);
    inline int add<SIMD_Type::Neon64>(int a, int b) {
        std::cout << "NEON!" << std::endl;
        return a + b;

    Here a call to add(1, 2) would first call add<SIMD_Type::Max>, which in turn would call add<SIMD_Type::AVX2, add<SIMD_Type::SSE3>, add<SIMD_Type::Neon128>, and then the call to add<SIMD_Type::Neon64> would call the specialization so recursion stops here.

    If you want to make this a bit more safer (to prevent long template instaciation chains) you can additionally add one specialization for each function that stops recursion if it fails to find any specialization, e.g.: godbolt example

    inline int add<SIMD_Type::Min>(int a, int b) {
        static_assert(SIMD_Type::Min == SIMD_Type::Min, "No implementation found!");
        return {};

    In your case it could look like this:

    #include <iostream>
     ** functions.h *******************
    enum class SIMD_Type {
        Min, // Dummy Value -> No Implementation found
        Max // Dummy Value -> Search downwards from here
    #include "functions_stubs.h"
    #include "functions_unoptimized.h"
    #ifdef __ARM_NEON
    #include "functions_neon64.h"
    #ifdef __SSE3__
    #include "functions_see3.h"
    // etc...
     ** functions_stubs.h *************
    template<SIMD_Type type = SIMD_Type::Max>
    inline int add(int a, int b) {
        constexpr SIMD_Type nextType = static_cast<SIMD_Type>(static_cast<int>(type) - 1);
        return add<nextType>(a, b);
    inline int add<SIMD_Type::Min>(int a, int b) {
        static_assert(SIMD_Type::Min == SIMD_Type::Min, "No implementation found!");
        return {};
     ** functions_unoptimized.h *******
    inline int add<SIMD_Type::None>(int a, int b) {
        std::cout << "NONE" << std::endl;
        return a + b;
     ** functions_neon64.h ************
    inline int add<SIMD_Type::Neon64>(int a, int b) {
        std::cout << "NEON!" << std::endl;
        return a + b;
     ** functions_neon128.h *******************
    inline int add<SIMD_Type::Neon128>(int a, int b) {
        std::cout << "NEON128!" << std::endl;
        return a + b;
     ** main.cpp **********************
    #include "functions.h"
    int main() {
        add(1, 2);

    would output NEON128!.


    • no #ifdef's needed in the implementation header files
    • callers don't need to be modified


    • Needs an extra dispatch function that recursively calls itself (until it hits an specialization)
    • The compiler might not optimize all recursive calls (altough most compilers probably will)
      Most compilers also offer you a way to force inlining for certain functions (__attribute__((always_inline)) / __forceinline) which you could add the the function base templates to make sure all recursive calls actually get inlined.
    • Optionally needs another function to stop recursive instanciation (not strictly required, compilers will stop recursive instanciation at some point)

    4. One file per function

    This is by far the easiest option - just put each function (or a collection of similar functions) into a single file and do the #ifdef's there.

    That way you have all the functions & their specializations for SIMD in a single file, which should also make editing a lot easier.


     ** functions.h *******************
    #include "functions_add.h"
    #include "functions_sub.h"
    // etc...
     ** functions_add.h ***************
    #ifdef __SSE3__
    // SSE3
    int add(int a, int b) {
      return a + b;
    #elifdef __ARM_NEON
    // NEON
    int add(int a, int b) {
      return a + b;
    // Fallback
    int add(int a, int b) {
      return a + b;
     ** functions_sub.h ***************
    #ifdef __SSE3__
    // SSE3
    int sub(int a, int b) {
      return a - b;
    #elifdef __ARM_NEON_128
    // NEON 128
    int sub(int a, int b) {
      return a - b;
    // Fallback
    int sub(int a, int b) {
      return a - b;


    • The function & all of its specializations are in a single file, so figuring out which one gets called is a lot easier
    • Easy to implement & maintain as long as you don't stuff too many functions into a single file


    • Potentially lots of header files
    • #ifdef's need to be repeated in each header