Search code examples
c++boostboost-pool

boost::pool_allocator significantly slower than std::allocator


I am learning about memory pool, and trying to utilize boost::pool_allocator in my project. According to the documentation, I made a small test about time cost:

template <typename Alloc>
void test()
{
    using namespace std::chrono;
    auto t0 = high_resolution_clock::now();
    for (int i = 0; i < 1000; ++i) {
        std::vector<int, Alloc> vec;
        for (int j = 0; j < 10000; ++j)
            vec.push_back(i + j);
    }
    auto t1 = high_resolution_clock::now();
    auto time_ms = duration<double>(t1 - t0).count() * 1e3;
    cout << "time cost: " << time_ms << " ms" << endl;
}

int main()
{
    test<std::allocator<int>>();
    test<boost::pool_allocator<int>>();
}

And the result is:

time cost: 3.97602 ms
time cost: 91.3943 ms

The Boost doc says:

Pools are generally used when there is a lot of allocation and deallocation of small objects.

So I expect boost::pool_allocator cost less time than std::allocator in the code above, but the test result shows it's much worse.

Am I using boost::pool_allocator wrong? In what situation can I obtain a speedup by using memory pool (or just Boost pool/pool_allocator)?


Solution

  • For reference:

    template <typename T,
        typename UserAllocator = default_user_allocator_new_delete,
        typename Mutex = details::pool::default_mutex,
        unsigned NextSize = 32,
        unsigned MaxSize = 0>
    class pool_allocator;
    

    I thought maybe the locking was to blame. Also, there could be better hinting.

    Let's test! Live On Compiler Explorer

    #include <boost/core/demangle.hpp>
    #include <boost/pool/pool_alloc.hpp>
    #include <chrono>
    #include <iomanip>
    #include <iostream>
    #include <vector>
    
    using namespace std::chrono_literals;
    auto static now = std::chrono::high_resolution_clock::now;
    
    template <typename Alloc> void test(int run, Alloc alloc = {}) {
        auto load = [=](bool RESERVE, unsigned ITERATIONS = 1'000, unsigned SIZE = 10'000) {
            for (unsigned i = 0; i < ITERATIONS; ++i) {
                std::vector<int, Alloc> vec(alloc);
                if (RESERVE)
                    vec.reserve(SIZE);
                for (unsigned j = 0; j < SIZE; ++j)
                    vec.push_back(i + j);
            }
        };
    
        auto lap_time = [t0 = now()]() mutable {
            return now() - std::exchange(t0, now());
        };
    
        load(false); auto without_reserve = lap_time() / 1.0ms;
        load(true);  auto with_reserve    = lap_time() / 1.0ms;
    
        std::cout << "run " << run                                             //
                  << " naive:    " << std::setw(7) << without_reserve << "ms"  //
                  << " reserved: " << std::setw(7) << with_reserve    << "ms"  //
                  << "(" << boost::core::demangle(typeid(Alloc).name()) << ")" //
                  << std::endl;
    }
    
    void run_tests(int run) {
        test<std::allocator<int>>(run);
    
        using NullMx    = boost::details::pool::null_mutex;
        using Mx        = boost::details::pool::default_mutex;
        using Malloc    = boost::default_user_allocator_malloc_free;
        using NewDelete = boost::default_user_allocator_new_delete;
    
        // 
        // no hints
        //
        test<boost::pool_allocator<int, Malloc,    NullMx>>(run);
        test<boost::pool_allocator<int, NewDelete, NullMx>>(run);
        test<boost::pool_allocator<int, Malloc,    Mx>>(run);
        test<boost::pool_allocator<int, NewDelete, Mx>>(run);
    
        //
        // hinted
        //
        test<boost::pool_allocator<int, Malloc,    NullMx, 1'000, 0>>(run);
        test<boost::pool_allocator<int, NewDelete, NullMx, 1'000, 0>>(run);
        test<boost::pool_allocator<int, Malloc,    Mx,     1'000, 0>>(run);
        test<boost::pool_allocator<int, NewDelete, Mx,     1'000, 0>>(run);
    }
    
    int main()
    {
        std::cout << std::fixed << std::setprecision(3);
    
        for (int run : {1,2,3}) {
            auto t0 = now();
            run_tests(run);
            std::cout << " -- Done (" << (now() - t0) / 1.ms << "ms)" << std::endl;
        }
    }
    

    Compiler Explorer is showing some real inconsistent spikes; my own machine doesn't:

    run 1 naive:      8.025ms reserved:   5.412ms(std::allocator<int>)
    run 1 naive:     92.212ms reserved:  31.166ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
    run 1 naive:     93.466ms reserved:  29.901ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
    run 1 naive:     92.488ms reserved:  29.883ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
    run 1 naive:     92.450ms reserved:  29.824ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
    run 1 naive:     82.879ms reserved:  27.478ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
    run 1 naive:     82.775ms reserved:  28.187ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
    run 1 naive:     83.189ms reserved:  27.404ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
    run 1 naive:     83.159ms reserved:  27.468ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
     -- Done (947.595ms)
    run 2 naive:      8.007ms reserved:   5.543ms(std::allocator<int>)
    run 2 naive:     92.225ms reserved:  29.882ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
    run 2 naive:     92.311ms reserved:  29.805ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
    run 2 naive:     92.601ms reserved:  29.873ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
    run 2 naive:     92.421ms reserved:  30.028ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
    run 2 naive:     83.028ms reserved:  27.493ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
    run 2 naive:     82.822ms reserved:  27.427ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
    run 2 naive:     83.230ms reserved:  27.493ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
    run 2 naive:     83.104ms reserved:  27.466ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
     -- Done (944.958ms)
    run 3 naive:      8.068ms reserved:   5.422ms(std::allocator<int>)
    run 3 naive:     92.282ms reserved:  29.880ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
    run 3 naive:     92.064ms reserved:  29.960ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
    run 3 naive:     92.339ms reserved:  29.928ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
    run 3 naive:     92.977ms reserved:  29.890ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
    run 3 naive:     82.906ms reserved:  27.388ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
    run 3 naive:     82.784ms reserved:  27.585ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
    run 3 naive:     83.157ms reserved:  28.233ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
    run 3 naive:     83.098ms reserved:  27.466ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
     -- Done (945.629ms)
    

    Still, clearly, always slower. Let's profile

    Profiler

    Comparing just standard allocator to boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>:

    • standard allocator causes 48k calls to new/delete, resulting in as many underlying calls to malloc/free

      enter image description here enter image description here

    • pool allocator shows vastly reduced numbers:

      enter image description here enter image description here

      and for malloc/free:

      enter image description here enter image description here

    So far so good! What is taking so much time then?

    enter image description here

    Where unordered_malloc inlines into a big number of lines from various Boost Pool headers. The top offenders are inlined from boost/pool/simple_segregated_storage.hpp: (second column are percentage of cost relative to parent):

    enter image description here

    Those lines are in try_malloc_n

    template <typename SizeType>
    void * simple_segregated_storage<SizeType>::try_malloc_n(
        void * & start, size_type n, const size_type partition_size)
    {
      void * iter = nextof(start);
      while (--n != 0)
      {
        void * next = nextof(iter);
        if (next != static_cast<char *>(iter) + partition_size)
        {
          // next == 0 (end-of-list) or non-contiguous chunk found
          start = iter;
          return 0;
        }
        iter = next;
      }
      return iter;
    }
    

    Which self-describes as:

    The function attempts to find n contiguous chunks of size partition_size in the free list, starting at start. If it succeds, it returns the last chunk in that contiguous sequence, so that the sequence is known by [start, {retval}] If it fails, it does do either because it's at the end of the free list or hits a non-contiguous chunk. In either case, it will return 0, and set start to the last considered chunk. You are at the end of the free list if nextof(start) == 0. Otherwise, start points to the last chunk in the contiguous sequence, and nextof(start) points to the first chunk in the next contiguous sequence (assuming an ordered free list).

    It does appear, then that the chasing of free blocks on the segregated heap is too costly in this type of scenario after all. A little napkin calculation shows that try_malloc_n takes 99.75% of the higher-level unordered_malloc call we saw earlier.

    enter image description here

    SHOCK: Alternative Implementation?

    During my investigations I found a number of defines that can be used to get more insight, e.g.:

    #define NDEBUG
    //#define BOOST_POOL_INSTRUMENT 1
    //#define BOOST_POOL_VALIDATE 1
    //#define BOOST_POOL_VALGRIND 1
    

    Now, I used VALIDATE/INSTRUMENT observing the expected effect (very verbose output and slight performance degradation).

    From reading the name/code I would have expected BOOST_POOL_VALGRIND to similarly degrade performance (after all, it should probably do extra work to avoid false positive memory errors when running Valgrind, right?). Much to my surprise, defining it makes the whole thing run lightning fast: Live On Compiler Explorer

    run 1 naive:      8.166ms reserved:   5.267ms(std::allocator<int>)
    run 1 naive:      9.713ms reserved:   5.267ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
    run 1 naive:      8.853ms reserved:   5.226ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
    run 1 naive:      8.990ms reserved:   5.282ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
    run 1 naive:      8.899ms reserved:   5.246ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
    run 1 naive:      8.620ms reserved:   5.237ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
    run 1 naive:      8.622ms reserved:   5.247ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
    run 1 naive:      8.963ms reserved:   5.257ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
    run 1 naive:      8.990ms reserved:   5.271ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
     -- Done (127.276ms)
    run 2 naive:      7.965ms reserved:   5.208ms(std::allocator<int>)
    run 2 naive:      8.503ms reserved:   5.236ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
    run 2 naive:      8.809ms reserved:   5.254ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
    run 2 naive:      8.954ms reserved:   5.278ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
    run 2 naive:      8.878ms reserved:   5.279ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
    run 2 naive:      8.694ms reserved:   5.243ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
    run 2 naive:      8.661ms reserved:   5.249ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
    run 2 naive:      8.920ms reserved:   5.248ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
    run 2 naive:      8.952ms reserved:   5.261ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
     -- Done (125.680ms)
    run 3 naive:      7.949ms reserved:   5.221ms(std::allocator<int>)
    run 3 naive:      8.498ms reserved:   5.238ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
    run 3 naive:      8.813ms reserved:   5.230ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
    run 3 naive:      9.033ms reserved:   5.279ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
    run 3 naive:      8.909ms reserved:   5.252ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
    run 3 naive:      8.605ms reserved:   5.244ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
    run 3 naive:      8.623ms reserved:   5.246ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
    run 3 naive:      8.918ms reserved:   5.247ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
    run 3 naive:      8.969ms reserved:   5.268ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
    

    Sadly looking at the details confirms that it is cheating by delegating to standard library direclty (while adding some overhead with sets of free_list/used_list addresses).

    enter image description here

    Summary

    Yes, the standard pool/simple_segregated_storage implementation performs badly under this load. Whether or not this is actually a bug I cannot tell with certainty, but it sure seems like it, going by the documentation (that you mentioned as well).