std::string implementation in GCC and its memory overhead for short strings

I am currently working on an application for a low-memory platform that requires an std::set of many short strings (>100,000 strings of 4-16 characters each). I recently transitioned this set from std::string to const char * to save memory and I was wondering whether I was really avoiding all that much overhead per string.

I tried using the following:

std::string sizeTest = "testString";
std::cout << sizeof(sizeTest) << " bytes";

But it just gave me an output of 4 bytes, indicating that the string contains a pointer. I'm well aware that strings store their data in a char * internally, but I thought the string class would have additional overhead.

Does the GCC implementation of std::string incur more overhead than sizeof(std::string) would indicate? More importantly, is it significant over this size of data set?

Here are the sizes of relevant types on my platform (it is 32-bit and has 8 bits per byte):

char: 1 bytes
void *: 4 bytes
char *: 4 bytes
std::string: 4 bytes

Solution

Well, at least with GCC 4.4.5, which is what I have handy on this machine, std::string is a typdef for std::basic_string<char>, and basic_string is defined in /usr/include/c++/4.4.5/bits/basic_string.h. There's a lot of indirection in that file, but what it comes down to is that nonempty std::strings store a pointer to one of these:

  struct _Rep_base
  {
size_type       _M_length;
size_type       _M_capacity;
_Atomic_word        _M_refcount;
  };

Followed in-memory by the actual string data. So std::string is going to have at least three words of overhead for each string, plus any overhead for having a higher capacity than `length (probably not, depending on how you construct your strings -- you can check by asking the capacity() method).

There's also going to be overhead from your memory allocator for doing lots of small allocations; I don't know what GCC uses for C++, but assuming it's similar to the dlmalloc allocator it uses for C, that could be at least two words per allocation, plus some space to align the size to a multiple of at least 8 bytes.