Search code examples
c++boost-asioc++-coroutine

Boost ASIO: What executor is associated with the default completion tokens? Should they be bound locally for performance?


Consider the following example (Godbolt):

#include <boost/asio.hpp>
#include <print>
#include <thread>

namespace asio = boost::asio;

inline void out(auto const& msg) {
  static const std::hash<std::thread::id> h {};
  std::println("T{} {}", h(std::this_thread::get_id()) % 100, msg);
}

asio::awaitable<void> f(auto strand) {
  auto executor = co_await asio::this_coro::executor;

  auto to_strd = bind_executor(strand, asio::deferred);
  auto to_coro = bind_executor(executor, asio::deferred);

  out("I'm on main");
  co_await dispatch(strand, asio::deferred);
  out("Still on main, asio::deferred has an associated executor");
  co_await dispatch(to_strd);
  out("Now I'm on the strand");
  co_await dispatch(asio::deferred);
  out("Back on main, using asio::deferred's associated executor");
  co_await dispatch(to_strd);
  out("Strand again");
  co_await dispatch(to_coro);
  out("Was this faster than dispatch(asio::deferred)?");
}

int main() {
  asio::io_context io;
  asio::thread_pool tp(1);

  co_spawn(io, f(make_strand(tp)), asio::detached);

  out("<--- this is main");
  io.run();
  tp.join();
}

When dispatch()'ing or post()'ing to asio-provided completion tokens such as deferred and use_awaitable, any provided executor is ignored. Instead, this_coro::executor is used (I think).

Two questions:

  • How does dispatch() end up with this_coro::executor? What's the mechanism? Becuase it isn't an executor_binder, it's an empty type. I understand it gets the executor from the coroutine stack, how did we get there?
  • Even though this works, for latency performance is it faster to bind_executor() at the top if we're going to be doing this multiple times? Or is it irrelevant?

Solution

  • The default associated executor is the coroutine's. The one you extracted with

    auto executor = co_await asio::this_coro::executor;
    

    You can actually make this more visible with a slightly enhanced demo

    Live On Coliru

    #include <boost/asio.hpp>
    #include <fmt/core.h>
    #include <thread>
    
    namespace asio = boost::asio;
    using namespace std::chrono_literals;
    
    static std::atomic_int tid_gen;
    thread_local int const tid = tid_gen++;
    
    inline void out(auto const& msg) { fmt::print("T{:x} {}\n", tid, msg); }
    
    asio::awaitable<void> f(auto strand) {
        auto original = co_await asio::this_coro::executor;
    
        auto to_strand = bind_executor(strand, asio::deferred);
        auto to_orig   = bind_executor(original, asio::deferred);
    
        out("------\n");
        out("I'm on original");
        co_await dispatch(strand, asio::deferred);
        out("Still on original, asio::deferred has an associated executor");
        co_await dispatch(to_strand);
        out("Now I'm on the strand");
        co_await dispatch(asio::deferred);
        out("Back on original, using asio::deferred's associated executor");
        co_await dispatch(to_strand);
        out("Strand again");
        co_await dispatch(to_orig);
        out("Was this faster than dispatch(asio::deferred)?");
    }
    
    int main() {
        out("<--- this is main");
        asio::thread_pool tp1(1);
        post(tp1, [] { out("I'm on tp1"); });
    
        asio::thread_pool tp2(1);
        post(tp2, [] { out("I'm on tp2"); });
    
        std::this_thread::sleep_for(10ms); // not very important but slightly simplifies output
    
        {
            asio::io_context io;
            co_spawn(io, f(make_strand(tp1)), asio::detached);
            io.run();
        }
        {
            asio::io_context io;
            co_spawn(io, f(make_strand(tp2)), asio::detached);
            io.run();
        }
        {                                                     //
            co_spawn(                                         //
                tp1,                                          //
                f(make_strand(tp2)),                          //
                consign(asio::detached, make_work_guard(tp1)) //
            );
        }
        tp1.join();
        tp2.join();
    }
    

    Printing

    T0 <--- this is main
    T1 I'm on tp1
    T2 I'm on tp2
    T0 ------
    T0 I'm on original
    T0 Still on original, asio::deferred has an associated executor
    T1 Now I'm on the strand
    T0 Back on original, using asio::deferred's associated executor
    T1 Strand again
    T0 Was this faster than dispatch(asio::deferred)?
    T0 ------
    T0 I'm on original
    T0 Still on original, asio::deferred has an associated executor
    T2 Now I'm on the strand
    T0 Back on original, using asio::deferred's associated executor
    T2 Strand again
    T0 Was this faster than dispatch(asio::deferred)?
    T1 ------
    T1 I'm on original
    T1 Still on original, asio::deferred has an associated executor
    T2 Now I'm on the strand
    T1 Back on original, using asio::deferred's associated executor
    T2 Strand again
    T1 Was this faster than dispatch(asio::deferred)?
    

    The extra questions:

    Q. How does dispatch() end up with this_coro::executor? What's the mechanism? Becuase it isn't an executor_binder, it's an empty type. I understand it gets the executor from the coroutine stack, how did we get there?

    It's handled by the promise-type's await_transform method. You should probably find it in awaitable_frame_base

    Q. Even though this works, for latency performance is it faster to bind_executor() at the top if we're going to be doing this multiple times? Or is it irrelevant?

    You can always measure. I'd expect carrying and decoding the explicitly associated executor binder will hurt performance, but it is possible that compilers optimize it all out after inlining?