Search code examples
c++performancevectorallocator

Will a custom allocator improve performance if many vector<T> s get constructed and destroyed?


In the code below, many vectors with each 10 ints gets constructed with 60% chance or an existing vector gets deleted with 40% chance. Thus, there will be many calls to new/malloc and delete. Since all these vectors are of type vector<int>, can a custom allocator help here to reduce calls to new and delete and thus increase performance? The idea is that the space of a deleted vector can be reused by a newly constructed one. How would such a allocator look like?

Note: This question is about allocators, that reduces calls to new and delete.

#include <iostream>
#include <vector>
#include <random>

using namespace std;

int main() 
{
    // Random generator and distribution
    mt19937 gen(123456);
    uniform_real_distribution<> dis01(0., 1.);

    // Make or delete 10E6 vectors.
    vector< vector<int> > v; //the inner vectors will make many calls to new and delete

    v.reserve(10E5); //assume some size.

    for(int i=0; i<10E6; ++i)
    {
        if(dis01(gen)<0.6) // if true: make new sub-vector
        {
            v.emplace_back(); //new sub-vector
            v.back().reserve(10);

            for(int k=0; k<10; ++k)
                v.back().emplace_back(k); //fill it with some numbers
        }
        else // else, delete the last entry if there is one.
            if(!v.empty())
                v.pop_back();
    }

    cout<<"v.size()= "<<v.size();       
    return 0;
}

Solution

  • This is for C++11. Older standards need additional stuff implemented in the allocator [1].

    This is proof-of-concept code. It runs and solves the example problem but suffers from several limitations. It still demonstrates how a custom allocator can be used to improve performance in a scenario where a lot of std::vectors are created and destroyed.

    PoolAlloc.hh:

    template<typename T>
    struct MemChunk
    {
        std::size_t buf_size=0;
        T* buf=nullptr;
        T* top=nullptr;
        std::size_t used=0;
    };
    
    template<typename T>
    class PoolAllocator
    {
        public:
        using value_type=T;
    
        PoolAllocator();
        explicit PoolAllocator(std::size_t);
        PoolAllocator(PoolAllocator const&) noexcept;
        template<typename U>
            PoolAllocator(PoolAllocator<U> const&) noexcept;
        PoolAllocator(PoolAllocator&&) noexcept;
        PoolAllocator& operator=(PoolAllocator const&)=delete;
        PoolAllocator& operator=(PoolAllocator&&)=delete;
        ~PoolAllocator();
    
        template <typename U> 
        struct rebind 
        {
            using other=PoolAllocator<U>;
        };
    
        T* allocate(std::size_t);
        void deallocate(T*, std::size_t) noexcept;
    
        template<typename U1, typename U2>
            friend bool operator==(PoolAllocator<U1> const&, PoolAllocator<U2> const&) noexcept;
    
        private:
        std::vector<MemChunk<T>>* memory_=nullptr;
        int* ref_count_=nullptr;
        std::size_t default_buf_size_=0;
    };
    
    template<typename T>
    PoolAllocator<T>::PoolAllocator():
        PoolAllocator{100000} {}
    
    template<typename T>
    PoolAllocator<T>::PoolAllocator(std::size_t buf_size):
        memory_{new std::vector<MemChunk<T>>},
        ref_count_{new int(0)},
        default_buf_size_{buf_size}
    {
        memory_->emplace_back();
        memory_->back().buf_size=buf_size;
        memory_->back().buf=new T[buf_size];
        memory_->back().top=memory_->back().buf;
        ++(*ref_count_);
    }
    
    template<typename T>
    PoolAllocator<T>::PoolAllocator(PoolAllocator const& src) noexcept:
        memory_{src.memory_},
        ref_count_{src.ref_count_},
        default_buf_size_{src.default_buf_size_}
    {
        ++(*ref_count_);
    }
    
    template<typename T>
    PoolAllocator<T>::PoolAllocator(PoolAllocator&& src) noexcept:
        memory_{src.memory_},
        ref_count_{src.ref_count_},
        default_buf_size_{src.default_buf_size_}
    {
        src.memory_=nullptr;
        src.ref_count_=nullptr;
    }
    
    template<typename T>
    template<typename U>
    PoolAllocator<T>::PoolAllocator(PoolAllocator<U> const& src) noexcept:
        memory_{src.memory_},
        ref_count_{src.ref_count_},
        default_buf_size_{src.default_buf_size_}
    {
        ++(*ref_count_);
    }
    
    template<typename T>
    PoolAllocator<T>::~PoolAllocator()
    {
        if (ref_count_!=nullptr)
        {
            --(*ref_count_);
            if (*ref_count_==0)
            {
                if (memory_!=nullptr)
                {
                    for (auto& it : *memory_)
                    {
                        delete[] it.buf;
                    }
                    delete memory_;
                }
                delete ref_count_;
            }
        }
    }
    
    template<typename T>
    T* 
    PoolAllocator<T>::allocate(std::size_t n)
    {
        MemChunk<T>* mem_chunk=&memory_->back();
        if ((mem_chunk->used+n)>mem_chunk->buf_size)
        {
            default_buf_size_*=2;
            memory_->emplace_back();
            mem_chunk=&memory_->back();
            std::size_t buf_size=default_buf_size_;
            if (n>default_buf_size_)
            {
                buf_size=n;
            }
            mem_chunk->buf_size=buf_size;
            mem_chunk->buf=new T[mem_chunk->buf_size];
            mem_chunk->top=mem_chunk->buf;
        }
        T* r=mem_chunk->top;
        mem_chunk->top+=n;
        mem_chunk->used+=n;
        return r;
    }
    
    template<typename T>
    void 
    PoolAllocator<T>::deallocate(T* addr, std::size_t n) noexcept
    {
        MemChunk<T>* mem_chunk=&memory_->back();
        if (mem_chunk->used>n and (mem_chunk->top-n)==addr)
        {
            mem_chunk->used-=n;
            mem_chunk->top-=n;
        }
    }
    
    template<typename U1, typename U2>
    bool operator==(PoolAllocator<U1> const& lhs, PoolAllocator<U2> const& rhs) noexcept
    {
        return (std::is_same<U1, U2>::value and lhs.memory_==rhs.memory_);
    }
    

    Using your example modified in the following way:

    #include <iostream>
    #include <vector>
    #include <random>   
    #include "PoolAlloc.hh"
    
    using namespace std;
    
    int main() 
    {
        // Random generator and distribution
        mt19937 gen(123456);
        uniform_real_distribution<> dis01(0., 1.);
        PoolAllocator<int> palloc{1000000};
    
        // Make or delete 10E6 vectors.
        vector< vector<int, PoolAllocator<int>> > v; //the inner vectors will make many calls to new and delete
    
        v.reserve(10E5); //assume some size.
    
        for(int i=0; i<10E6; ++i)
        {
            if(dis01(gen)<0.6) // if true: make new sub-vector
            {
                v.emplace_back(palloc); //new sub-vector
                v.back().reserve(10);
    
                for(int k=0; k<10; ++k)
                    v.back().emplace_back(k); //fill it with some numbers
            }
            else // else, delete the last entry if there is one.
                if(!v.empty())
                    v.pop_back();
        }
    
        cout<<"v.size()= "<<v.size();   
        return 0;
    }
    

    The number of calls to malloc drops from ~6e6 down to 21. Total number of instructions drops from 3.7e9 to 2.5e9 (using -O3, measured with valgrind --tool=callgrind).

    There are a couple of implementation details that will impact the performance in different situations of usage.

    Currently multiple buffers are used. If one is full, another one is created. This way there never has to be a reallocation operation which would get you into a world of hurt (see comments).

    The biggest question is, how to deal with deallocated memory. Currently a trivial approach is used that only makes deallocated memory available to later allocates when it is at the end of the buffer. For your example that is sufficient, as you only deallocate memory at the end of the buffer.

    For more complex scenarios you will need a more sophisticated mechanism. Some data structure is needed to store the addresses and sizes of free memory chunks. Multiple concepts are possible here and their performance will vary with the concrete situation they are used in. I doubt there is a good one-size-fits-all solution here.

    [1] http://howardhinnant.github.io/stack_alloc.html