Search code examples
c++stringgdbstd

Internal struct of std string object


I tried to understand the internal struct of std::String using GDB , and I want to see if I understand that as well.

I have std::string object that contains the string AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA (32 A).

When I looking into GDB i see header: 0x00000020 0x00000020 0x00000000

data: 0x41414141 0x41414141 0x41414141 0x41414141 0x41414141 0x41414141 0x41414141 0x41414141

And when that object release with std::string::~string I see data same but ,header :

0x00000000 0x00000020 0xffffffff

Is that right ? the 0x20 is the size of string (why I see it twice? ) and when std::string object is release 0x00000000 replaced with 0xffffffff ?

I didn't understand that as well please


Solution

  • I look at the internal structure of library types in order to understand more about how a compiler performs it's magic. Especially container objects. Standard approach is to copy the object to a std::array that is the same size, then print the array in hex. This can be useful to explore exactly what happens when a container object is "moved" as well as learning how the different library coders implemented the container.

    Here's the basic code adapted to std::string. Examined is how the object changes between an empty string, a string with the maximum, SSO, contents, and a string which requires that the string is stored in the heap.

    SSO optimized strings requires the pointer to any character in the string be between the start and end of the string object.

    #include <string>
    #include <array>
    #include <iostream>
    #include <iomanip>
    #include <cstdint>
    
    void print_string_object(std::string& s)
    {
        // Check that the size of a string object is a multiple of a pointer size
        static_assert(sizeof(uintptr_t) * (sizeof(s) / sizeof(uintptr_t)) == sizeof(s));
    
        // Create an array of uintptr_t that is the same size as a string
        using s_obj = std::array<uintptr_t, sizeof(s) / sizeof(uintptr_t)>;
        s_obj s_ptrs = *reinterpret_cast<s_obj*>(static_cast<void*>(&s));
    
        // Print details of string object in hex
        std::cout << "Address of Object\n  " << std::setfill('0') << std::setw(2*sizeof(uintptr_t)) << std::hex << &s << "\nObject\n";
        for (auto x : s_ptrs)
            std::cout << "  " << std::setfill('0') <<  std::setw(2 * sizeof(uintptr_t)) << std::hex << x << '\n';
    }
    
    int max_SSO(std::string &s)
    {
        // return the maximum string stored in a string object (SSO)
        // and set s with bytes 0 1 2 3 ... until SSO is maxed out
        std::string s0;
        uintptr_t base = reinterpret_cast<uintptr_t>(&s0);
        uintptr_t top = reinterpret_cast<uintptr_t>(&s0) + sizeof(s0);
        for (int i = 0;; i++)
        {
            s0 += static_cast<char>(i);
            if (reinterpret_cast<uintptr_t>(&s0[0]) < base || reinterpret_cast<uintptr_t>(&s0[0]) >= top)
                return i;
            s += static_cast<char>(i);
        }
    }
    
    int main()
    {
        std::string s;
        std::cout << "Capacity of empty string=" << s.capacity() << '\n';
        std::cout << "Empty string\n";
        print_string_object(s); // print details of null string 
        std::cout << "\nFull SSO string length=" << std::dec <<  max_SSO(s) << "\n";
        print_string_object(s); // print details of max SSO string 
        s += "0";
        std::cout << "\nDynamic memory string\n";
        print_string_object(s); // print details of dynamic allocated string 
    }
    

    And here's a link to compiler explorer for clang and gcc

    MSVC output x64 is:

        Capacity of empty string=15
    Empty string
    Address of Object
      000000AF535AF840
    Object
      0000000000000000
      0000000000000000
      0000000000000000
      000000000000000f
    
    Full SSO string length=15
    Address of Object
      000000AF535AF840
    Object
      0706050403020100
      000e0d0c0b0a0908
      000000000000000f
      000000000000000f
    
    Dynamic memory string
    Address of Object
      000000AF535AF840
    Object
      00000235a757e8a0
      000e0d0c0b0a0908
      0000000000000010
      000000000000001f
    

    For MSVC, the first 16 bytes are used to store the SSO chars. This allows for a string length of 15 with the required terminating null char. When dynamic memory is required for longer strings, the first 8 bytes is a pointer to the chars stored in the heap. The last 2 entries are the current string size and maximum string size required before memory allocation is needed. GCC and CLANG have somewhat different layouts. CLANG, in particular allows SSO string sizes up to 22 chars and it's object size is 8 bytes less! Very efficient.

    I've found the approach very useful for quickly understanding what is actually going on in library container code.