Compile-time impact of using recursive template implementation for tuples

Several sources, e.g., Why is it not good to use recursive inheritance for std::tuple implementations?, point out the compile-time cost of using recursive template implementation for tuples. I'm considering writing a system that will use 100s of tuples which might each be of size 1-100 (median around 5 to 10 I would guess), and wondering whether I should care that the standard libraries I will be using have the recursive implementation? Can anyone quantify this -- e.g. impact on build time is of order of 25%, or memory usage is of order of 50% more? I've been bitten badly before with other template instantiation issues, hence my concern.

My main interest is creating tuples, and then using iterating them using std::get, likely using code on the following lines:

// for_each_tuple: from https://www.cppstories.com/2022/tuple-iteration-apply/
template <typename TupleT, typename Fn>
void for_each_tuple(TupleT&& tp, Fn&& fn)
{
    std::apply([&fn]<typename... T>(T&&... args) { (fn(std::forward<T>(args)), ...); },
               std::forward<TupleT>(tp));
}

std::get is clearly non-recursive, hence O(1) for the compiler, for libstdc++, but the code for msvc and gcc isn't as clear; I think msvc is recursive, O(n), but I can't figure out the gcc code. Are they both recursive?

msvc has std::get

template <size_t _Index, class... _Types>
constexpr tuple_element_t<_Index, tuple<_Types...>>& get(tuple<_Types...>& _Tuple) noexcept {
    using _Ttype = typename tuple_element<_Index, tuple<_Types...>>::_Ttype;
    return static_cast<_Ttype&>(_Tuple)._Myfirst._Val;
}

then (in utility)

template <size_t _Index, class _This, class... _Rest>
struct tuple_element<_Index, tuple<_This, _Rest...>>
    : tuple_element<_Index - 1, tuple<_Rest...>> {}; // recursive tuple_element definition

gcc libstdc++

  template<size_t __i, typename... _Elements>
    constexpr __tuple_element_t<__i, tuple<_Elements...>>&
    get(tuple<_Elements...>& __t) noexcept
    { return std::__get_helper<__i>(__t); }

  template<size_t __i, typename _Head, typename... _Tail>
    constexpr _Head&
    __get_helper(_Tuple_impl<__i, _Head, _Tail...>& __t) noexcept
    { return _Tuple_impl<__i, _Head, _Tail...>::_M_head(__t); }

clang libc++

template <size_t _Ip, class ..._Tp>
typename tuple_element<_Ip, tuple<_Tp...> >::type&
get(tuple<_Tp...>& __t) _NOEXCEPT
{
    typedef typename tuple_element<_Ip, tuple<_Tp...> >::type type;
    return static_cast<__tuple_leaf<_Ip, type>&>(__t.__base_).get();
}

For construction, the two standard libraries I am using, msvc and gcc, both use a recursive approach.

msvc

template <class _This, class... _Rest>
class tuple<_This, _Rest...> : private tuple<_Rest...> { // recursive tuple definition
public:
    using _This_type = _This;
    using _Mybase    = tuple<_Rest...>;

gcc

  /**
   * Recursive tuple implementation. Here we store the @c Head element
   * and derive from a @c Tuple_impl containing the remaining elements
   * (which contains the @c Tail).
   */
  template<size_t _Idx, typename _Head, typename... _Tail>
    struct _Tuple_impl<_Idx, _Head, _Tail...>
    : public _Tuple_impl<_Idx + 1, _Tail...>,
      private _Head_base<_Idx, _Head>
    {
      template<size_t, typename...> friend struct _Tuple_impl;

c.f. clang

template <class ..._Tp>
class tuple
{
    typedef __tuple_impl<typename __make_tuple_indices<sizeof...(_Tp)>::type, _Tp...> _BaseT;
    _BaseT __base_;

In summary, if I've read the standard library code right, it seems like at least one of msvc and gcc std::get is recursive hence O(n) for the compiler and O(n^2) for the for_each code I lifted from cpp stories. Is it possible to get the iteration down to O(n) for the compiler? And even then is there still a significant time/space cost for the compiler due to the recursive nature of tuple construction on msvc and gcc?

(I'm using msvc 2022 17.6, gcc 13.2, clang 16, and have deleted some boilerplate in the excerpts from the standard libraries, to aid readability.)

Solution

I ended up doing the work to measure performance for different tuple sizes and different implementations (msvc, libc++ libstdc++, boost hana tuple) -- what I was trying to avoid by asking the question :).

Main conclusions from the performance data I collected.

Measurements are consistent with O(n^2) compile time complexity for msvc and libstdc++.
Hana tuple is indeed the fastest for larger tuples: compile time increase is close to linear with tuple size. However on linux (gcc/clang), std::tuple is fine for smaller tuples.
Msvc appears to heavily consume resources compiling std::tuple, whereas with hana::tuple, msvc compile times were pretty much instantaneous. Even compiling a 3-std::tuple I saw a compile time of about .8s compared to a flat 0.05s for hana::tuple regardless of size.
Most worrying with msvc was memory consumption-- tuple size 196 saw ~7GB used, and I was unable to try anything larger as a result. I've seen large memory usage with previous msvc compilers and template-heavy code in a commercial setting -- it can be more important than time-to-compile because it limits the number of parallel compiles (and certainly more troublesome, as builds fall over unpredictably).
The libc++ std::tuple does indeed scale very well. However, the compiler appears slower, hence these gains are only apparent for large tuples with size over 100.
Strangely, clang scales much better than expected. With libstdc++ scaling appears to be better than quadratic, and with libc++ better than linear. I can't explain this, but would guess that clang has some neat optimizations re template instantiations, or is exploiting patterns in my copy-and-paste test data.

Hana's claim

The reason for providing our own tuple is principally performance. Indeed, all the std::tuple implementations tested so far have a very bad compile-time performance

holds out for very large tuples, and on msvc regardless of tuple size. I wonder how actively it's maintained given advances in C++ since C++14, and whether further optimizations might now be possible.

Results below. Note the performance measurements were on a slower hardware for linux (gcc and clang). Memory usage is very approximate.

Larger image including data: Time/s to compile tuple construction / get with different compilers and tuple implementations.

Test code

//#define STD_TUPLE
#ifndef STD_TUPLE
#include <boost/hana.hpp>
// note, including boost/hana/tuple and boost/hana/for_each.hpp leads to faster builds, especially on gcc 
#endif
#include <iostream>
#include <tuple>

#ifdef STD_TUPLE
template <typename TupleT, typename Fn>
void for_each_tuple(TupleT&& tp, Fn&& fn)
{
    std::apply(
        [&fn]<typename... T>(T&&... args) { (fn(std::forward<T>(args)), ...); },
        std::forward<TupleT>(tp));
}
#endif

void print_tuple()
{
#ifdef STD_TUPLE
    std::tuple t{
#else
    auto t = boost::hana::make_tuple(
#endif
        // repeat this line 4x, 8x, or 16x
        0,0.0,"0", 1,1.0,"1", 2,2.0,"2", 3,3.0,"3", 4,4.0,"4", 5,5.0,"5", 6,6.0,"6", 7,7.0,"7"
#ifdef STD_TUPLE
    };
    for_each_tuple(t, [](const auto& x) { std::cout << x; });
#else
    );
    boost::hana::for_each(t, [&](const auto& x) { std::cout << x; });
#endif
}

To bring in hana in CMake, FetchContent_Declare(hana SYSTEM GIT_REPOSITORY https://github.com/boostorg/hana.git GIT_TAG ${gitTag}), FetchContent_MakeAvailable(hana), and target_link_libraries(mylib hana). For msvc I needed to delete the line constexpr auto is_literal_type = detail::hana_trait<std::is_literal_type>{}, as msvc has followed the C++20 standard and removed std::is_literal_type (cppreference).