Is it a good idea to cache a raw pointer along with its owning shared_ptr for better access performance?

Consider this scenario:

class A
{
    std::shared_ptr<B> _b;
    B* _raw;

    A(std::shared_ptr<B> b)
    {
         _b = b;
         _raw = b.get();
    }

    void foo()
    {
         // Use _raw instead of _b
         // avoid one extra indirection / memory jump
         // and also avoid polluting cache
    }
};

I know that technically it works and offers a slight performance advantage (I tried it). (EDIT: False conclusion). But my question is: Is it conceptually wrong? Is it bad practice? And why? And if not why is this hack not more commonly used?

Here is a minimal reproducible example comparing raw pointer access to shared_ptr access:

#include <chrono>
#include <iostream>
#include <memory>

struct timer final
{
    timer()
        : start{std::chrono::system_clock::now()}
    { }

    void elapsed()
    {
        auto now = std::chrono::system_clock::now();
        std::cout << std::chrono::duration<double>(now - start).count() << " seconds" << std::endl;
    }

private:
    std::chrono::time_point<std::chrono::system_clock> start;
};

struct A
{
    size_t a [2097152];
};

int main()
{
    size_t data_size = 2097152;
    size_t count = 10000000000;

    // Using Raw pointer
    A * pa = new A();
    timer t0;
    for(size_t i = 0; i < count; i++)
        pa->a[i % data_size] = i;
    t0.elapsed();

    // Using shared_ptr
    std::shared_ptr<A> sa = std::make_shared<A>();
    timer t1;
    for(size_t i = 0; i < count; i++)
        sa->a[i % data_size] = i;
    t1.elapsed();
}

Output:

3.98586 seconds

4.10491 seconds

I ran this multiple times and the results are consistent.

EDIT: As per the consensus in the answers, the above experiment is invalid. Compilers are way smarter than they appear.

Solution

This answer proves your test is invalid (correct performance measurements in C++ are quite hard since there are lots of pitfalls) and as a result you come to invalid conclusions.

Take a look on this godbolt.

for loop for first version:

.L39:
        mov     rdx, rax
        and     edx, 2097151
        mov     QWORD PTR [rbp+0+rdx*8], rax
        add     rax, 1
        cmp     rax, rcx
        jne     .L39

for loop for second version:

.L40:
        mov     rdx, rax
        and     edx, 2097151
        mov     QWORD PTR [rbp+16+rdx*8], rax
        add     rax, 1
        cmp     rax, rcx
        jne     .L40

I do not see a difference! Results should be exactly same.

So I suspect that you have done measurements when building in Debug configuration.

Here is version where you can compare this

What is more interesting clang is able to optimize away for loop if shared pointer is not used. It noticed there is no viable results of this loop and just remove it. So if you used release configuration compiler just outsmart you.

Bottom line:

shared_ptr do not provide overhead
when checking performance you must compile with optimizations enabled
you must also ensure if test code was not optimized away to be sure that results are valid.

Here is proper test written using google benchmark and test results for both cases are exactly the same.