Search code examples
c++performanceshared-ptr

Is it a good idea to cache a raw pointer along with its owning shared_ptr for better access performance?


Consider this scenario:

class A
{
    std::shared_ptr<B> _b;
    B* _raw;

    A(std::shared_ptr<B> b)
    {
         _b = b;
         _raw = b.get();
    }

    void foo()
    {
         // Use _raw instead of _b
         // avoid one extra indirection / memory jump
         // and also avoid polluting cache
    }
};

I know that technically it works and offers a slight performance advantage (I tried it). (EDIT: False conclusion). But my question is: Is it conceptually wrong? Is it bad practice? And why? And if not why is this hack not more commonly used?

Here is a minimal reproducible example comparing raw pointer access to shared_ptr access:

#include <chrono>
#include <iostream>
#include <memory>

struct timer final
{
    timer()
        : start{std::chrono::system_clock::now()}
    { }

    void elapsed()
    {
        auto now = std::chrono::system_clock::now();
        std::cout << std::chrono::duration<double>(now - start).count() << " seconds" << std::endl;
    }

private:
    std::chrono::time_point<std::chrono::system_clock> start;
};

struct A
{
    size_t a [2097152];
};

int main()
{
    size_t data_size = 2097152;
    size_t count = 10000000000;

    // Using Raw pointer
    A * pa = new A();
    timer t0;
    for(size_t i = 0; i < count; i++)
        pa->a[i % data_size] = i;
    t0.elapsed();

    // Using shared_ptr
    std::shared_ptr<A> sa = std::make_shared<A>();
    timer t1;
    for(size_t i = 0; i < count; i++)
        sa->a[i % data_size] = i;
    t1.elapsed();
}

Output:

3.98586 seconds

4.10491 seconds

I ran this multiple times and the results are consistent.

EDIT: As per the consensus in the answers, the above experiment is invalid. Compilers are way smarter than they appear.


Solution

  • This answer proves your test is invalid (correct performance measurements in C++ are quite hard since there are lots of pitfalls) and as a result you come to invalid conclusions.

    Take a look on this godbolt.

    for loop for first version:

    .L39:
            mov     rdx, rax
            and     edx, 2097151
            mov     QWORD PTR [rbp+0+rdx*8], rax
            add     rax, 1
            cmp     rax, rcx
            jne     .L39
    

    for loop for second version:

    .L40:
            mov     rdx, rax
            and     edx, 2097151
            mov     QWORD PTR [rbp+16+rdx*8], rax
            add     rax, 1
            cmp     rax, rcx
            jne     .L40
    

    I do not see a difference! Results should be exactly same.

    So I suspect that you have done measurements when building in Debug configuration.

    Here is version where you can compare this

    What is more interesting clang is able to optimize away for loop if shared pointer is not used. It noticed there is no viable results of this loop and just remove it. So if you used release configuration compiler just outsmart you.

    Bottom line:

    • shared_ptr do not provide overhead
    • when checking performance you must compile with optimizations enabled
    • you must also ensure if test code was not optimized away to be sure that results are valid.

    Here is proper test written using google benchmark and test results for both cases are exactly the same.