Search code examples
c++templatesoptimizationlambdac++17

Inlining lambda calls to template functions


I wish to have a template method, which takes in data and processes it with a lambda function, whatever way the method itself wants to do that. However, I want the lambda function to get inlined so that the compiled assembly output won't end up having a "call" assembly instruction. Is this possible?

If it's not possible with lambdas, is there some other way to do that? Somehow using templates to pass a function as a template type or something?

I'm using C++17.

Below is an example of what I'm trying to achieve:

template <typename T>
static inline void Process(const T*                p_source1,
                           const T*                p_source2,
                           T*                      p_destination,
                           const int               count,
                           std::function<T (T, T)> processor)
{
    for (int i = 0; i < count; i++)
        p_destination[i] = processor(p_source1[i], p_source2[i]);
}


void Process_Add(const uint8_t* p_source1,
                 const uint8_t* p_source2,
                 uint8_t*       p_destination,
                 const int      count)
{
    // How to make something like this lambda inline?
    auto lambda = [] (uint8_t a, uint8_t b) { return a + b; };

    Process<uint8_t>(p_source1, p_source2, p_destination, count, lambda);
}

Solution

  • Yes, it's possible, but std::function is making it very unlikely because the call mechanism is so complex that it can't be inlined, even in simple cases. See Understanding the overhead from std::function and capturing synchronous lambdas

    Here's the typical way of making inlining more likely:

    template <typename T, typename F>
      requires (std::invocable<F, const T&, const T&> // optional: C++20 constraint
            && std::convertible_to<std::invoke_result<F, const T&, const T&>, T>)
    inline void Process(const T*  p_source1,
                        const T*  p_source2,
                        T*        p_destination,
                        const int count,
                        F         processor)
    {
        for (int i = 0; i < count; i++)
            p_destination[i] = processor(p_source1[i], p_source2[i]);
    }
    

    Each lambda expression has a unique closure type, so processor(...) invokes a call operator which is known at compile time. This makes inlining quite likely, as long as the lambda expression is relatively short.

    Further notes

    You could imitate the C++20 constraint with std::enable_if_t, or you could just leave the function unconstrained.

    Using static and inline in combination is basically pointless. static communicates internal linkage for functions, and that's likely not your intent, assuming this template is used in more than one cpp file. See Should one never use static inline function?