I am currently working on an application for a low-memory platform that requires an std::set of many short strings (>100,000 strings of 4-16 characters each). I recently transitioned this set from std::string to const char * to save memory and I was wondering whether I was really avoiding all that much overhead per string.
I tried using the following:
std::string sizeTest = "testString";
std::cout << sizeof(sizeTest) << " bytes";
But it just gave me an output of 4 bytes, indicating that the string contains a pointer. I'm well aware that strings store their data in a char * internally, but I thought the string class would have additional overhead.
Does the GCC implementation of std::string incur more overhead than sizeof(std::string) would indicate? More importantly, is it significant over this size of data set?
Here are the sizes of relevant types on my platform (it is 32-bit and has 8 bits per byte):
char: 1 bytes
void *: 4 bytes
char *: 4 bytes
std::string: 4 bytes
Well, at least with GCC 4.4.5, which is what I have handy on this
machine, std::string
is a typdef for std::basic_string<char>
, and
basic_string
is defined in
/usr/include/c++/4.4.5/bits/basic_string.h
. There's a lot of
indirection in that file, but what it comes down to is that nonempty
std::string
s store a pointer to one of these:
struct _Rep_base
{
size_type _M_length;
size_type _M_capacity;
_Atomic_word _M_refcount;
};
Followed in-memory by the actual string data. So std::string
is
going to have at least three words of overhead for each string, plus
any overhead for having a higher capacity
than `length
(probably
not, depending on how you construct your strings -- you can check by
asking the capacity()
method).
There's also going to be overhead from your memory allocator for doing
lots of small allocations; I don't know what GCC uses for C++, but
assuming it's similar to the dlmalloc
allocator it uses for C, that
could be at least two words per allocation, plus some space to align
the size to a multiple of at least 8 bytes.