Performance of simple c++20 coroutines looks bad. Is this unavoidable? Is this cost of "frame-switching"?

I am working with C++20 coroutines, specifically simple generators, but I have observed similar results with coroutines replacing state machines based on boost::msm.

Actually my goal is to provide recommendation to my team about when to use c++20 coroutines after upgrade from c++17 to C++20 move. So these examples are just to help making this recommendation, they are not real code to fix.

I have quick-bench.com that show, in the case of simple generators, the performance of such coroutines can be 4x (gcc13, -O3) or 3.5x (clang17, -O3) worse than functionally identical code based on iterators.

The full code is available under this link; the following is just an example code for creating a Cartesian product of two ranges:

Reference code

void cartesian(auto const& xx, auto const& yy, auto body)
{
    for (auto x : xx) 
       for (auto y : yy)
           body(std::make_pair(x, y));
} 
// usage: cartesian(container1, container2, [&](auto const& x_y) { .... });

Its "coroutine" equivalent

template <typename T> struct generator { /*something similar to C++23 std::generator */ };

template <typename XX, typename YY>
generator<std::pair<typename XX::value_type, typename YY::value_type>>
cartesian(auto const& xx, auto const& yy)
{
    for (auto x : xx) 
       for (auto y : yy)
           co_yield std::make_pair(x, y);

}

// usage: 
// for (const auto& x_y : cartesian(container1, container2)) 
// { .... }

Functionally identical code based on iterators

template <typename R1, typename R2>
class cartesian_combine
{
public:
    explicit cartesian_combine(const R1& r1, const R2& r2) 
        : xx(r1), yy(r2)
    {}

    struct iterator_sentinel {};

    template <typename It1, typename S1,
              typename It2, typename S2>
    struct const_iterator
    {
        const_iterator(It1 xx_begin, S1 xx_end,
                       It2 yy_begin, S2 yy_end)
            : xx_begin(xx_begin), xx_it(xx_begin), xx_end(xx_end),
              yy_begin(yy_begin), yy_it(yy_begin), yy_end(yy_end)
        {
            if (yy_it == yy_end) { xx_it = xx_end; }
        }
        const It1 xx_begin;
        It1 xx_it;
        const S1 xx_end;
        const It2 yy_begin;
        It2 yy_it;
        const S2 yy_end;

        bool operator==(iterator_sentinel) const {
            return xx_it == xx_end;
        }
        auto operator*() const {
            return std::make_pair(*xx_it, *yy_it);
        }
        const_iterator& operator++() {
            if (++yy_it == yy_end) 
            { 
                ++xx_it; 
                yy_it = yy_begin; 
            }
            return *this;
        }
    };

    auto begin() const { return const_iterator<
        decltype(xx.begin()), decltype(xx.end()),
        decltype(yy.begin()), decltype(yy.end())
      >{xx.begin(), xx.end(), 
        yy.begin(), yy.end()}; 
    }
    iterator_sentinel end() const { return {}; }

private:
    const R1& xx;
    const R2& yy;
};

auto 
cartesian(const auto& xx, const auto& yy)
{
    return cartesian_combine(xx, yy);
}
// usage - identical as in "coroutine" case

It's evident that the "coroutine" code is much simpler than this bug-prone code manipulating a bunch of iterators. However, the performance cost is something that I am genuinely worried about. So, my question is, I am not sure—probably, this is the cost of switching between frames of the main function and this coroutine. But why is this cost so high?

In my example, I combine two arrays with 10 elements each, resulting in 100x switchings between these frames (or contexts — I am not sure what the proper terminology is here). Probably, this is not the cost of dynamic memory allocation for coroutine frame because it happens just once, and I even tried returning a preallocated buffer from the generator's promise_type operator new(), but this did not help.

Just in case something is wrong with my generator - the code:

template <typename T>
class generator
{
public:
    struct promise_type;
    using handle_type = std::coroutine_handle<promise_type>;
    struct promise_type
    {
        const T* value{};
        std::suspend_always yield_value(const T& v) {
            value = &v;
            return {};
        }

        std::suspend_never initial_suspend() { return {}; }
        std::suspend_always final_suspend() noexcept { return {}; }
        void return_void() {}
        void unhandled_exception() { std::exit(-1); }

        generator get_return_object() { 
            return generator(handle_type::from_promise(*this)); 
        }
    };

    explicit generator(handle_type h) : handle(h) {}
    ~generator() 
    {
        if (handle)
            handle.destroy();
    }
    generator(generator&& other) 
        : handle(std::exchange(other.handle, {}))
    {}

    generator& operator=(generator&& other)
    {
        if (this != &other)
        {
            if (handle) handle.destroy();
            handle = std::exchange(other.handle, {});
        }
        return *this;
    } 

    struct iterator_sentinel {};
    struct const_iterator
    {
        handle_type handle;
        bool operator==(iterator_sentinel) const {
            return handle.done();
        }
        const T& operator*() const {
            return *handle.promise().value;
        }
        const_iterator& operator++() {
            handle.resume();
            return *this;
        }
    };

    const_iterator begin() const { return {handle}; }
    iterator_sentinel end() const { return {}; }

private:
    handle_type handle;
};

Solution

The range cross product and iterator code does nothing besides manipulate indexes, and a single memory load. If it wasn't for the hard coded "don't optimize" it would optimize down to doing nothing.

The coroutine code does not optimize down to doing nothing, because it sets up the ability to do arbitrary actions, and it does type erasure.

If your operations are empty, coroutines are a bad idea.

If, however, your operations are as cheap as a single memory allocation, the difference almost completely disappears.

static void coro_combine_test(benchmark::State& state) {
    std::vector<std::unique_ptr<int[]>> buff;
    for (auto _ : state) {
      for (auto x : coro_combine(xx, yy)) {
        buff.emplace_back( new int[x.first*x.second] );
        benchmark::DoNotOptimize(x);
      }
      buff.clear();
    }
}
BENCHMARK(coro_combine_test);

here I add a memory allocation (of 4 to 400 bytes) per visited element, and then I free them at the end of each iteration. This represents a bit of non-trivial computation being done.

When I do that, I get 13,000 for the iterator and range versions, and 15,000 for the coroutine version.

Coroutines, as implemented in C++, are not suited to replace per-pixel operations, where the amount of iteration far dominates the amount of work done in the iteration. Your benchmark is a good way to show this.

If you did a similar test using a type-erased iterator of any kind you'd get similar overhead problems; this isn't specific to coroutines, but to type erasure.

A way I usually handle this is to ensure that the chunks of data I'm iterating over are non-trivial in type erased code. For example, if I'm iterating over pixels, my runtime type-erasure is at the level of scanlines or blocks of scanlines.

Any higher resolution type erasure I make it compile-time, so I can write per-pixel operations, have my compile-time template rewrite it into block or scanline operations, then feed those block or scanline operations into the type erased (iteration) or (coroutine) or whatever system.

void ForeachScanline( MyImage* pImage, std::function<void(std::size_t y, std::span<MyPixel>)>);
template<class F>
void ForeachPixel( MyImage* pImage, F f ) {
  ForeachScanline( pImage, [&f](std::size_t y, std::span<MyPixel> pixels) {
    for (std::size_t x = 0; x < pixels.size(); ++x) {
      f(MyLocation{x,y}, pixels[x]);
    }
  });
}

the runtime type-erased scanline implementation is used by a compile-time type-erased pixel implementation. This means that the runtime type-erasure overhead occurs far less often; the scanline loop can be inlined and mixed with the per-pixel operation.

The decision to make coroutines require type erasure led to this result. Compiler writers had issues with doing it without type erasure, as the size of a coroutine is difficult to determine early enough. A type erased coroutine can have its size determined before its body is fully understood.