Search code examples
stringrustheap-memory

Why can a `str` type be of any size (unknown size) while a `String` type size is supposedly known?


I was learning Rust with a book and the following excerpt threw me a bit off:

Also note that &str has the & in front of it because you need a reference to use a str. That's because of the reason we saw above: the stack needs to know the size, and a str can be of any length. So we access it with a &, a reference. The compiler knows the size of a reference's pointer, and it can then use the & to find where the str data is and read it. Also, because you use a & to interact with a str, you don't own it. But a String is an "owned" type.

I understand that for variables of unknown size, you must place the data on the heap and then reference to it with a fixed-length pointer on the stack. My confusion lies with the statement that str can be of any length.

Why can't a String type also be of unknown length at times and require the whole reference to data on heap approach?

I understand that the book will probably dive deeper into the details later on, but I was wondering if someone could already provide some additional context for me, specifically regarding the question above? Any useful accompanying details regarding the &str and String types in Rust, that are good to know for a beginner to the language, are highly appreciated as well.


Solution

  • Like a slice [T], str is a variably-sized type. (In fact, str is essentially a [u8] guaranteed to contain valid UTF-8.)

    Variably-sized types are special. They do not implement the Sized trait. A reference to a variably-sized type is "fat": it doesn't just hold the address of the referenced thing, but also its size.

    str therefore means "some area in memory which contains valid UTF-8 data". And &str is "the address and size of such an area".


    String on the other hand is a struct with a fixed size. One of its members is a pointer to string data somewhere else (on the heap). Conceptually, a String contains a &str along with the unused capacity of the memory area. (In reality, a String is a wrapper around a Vec<u8> with UTF-8 guarantee, a Vec<u8> conceptually contains a &[u8] plus capacity but is really a raw pointer, size and capacity.)

    The total memory required by a String is therefore still variable, but the part that is the String struct itself is known.

    Why is it this way? Because the entire point of String is to manage a memory region containing string data, and it can't do that if it is the memory region containing string data.


    An aside:

    I understand that for variables of unknown size, you must place the data on the heap

    This is a misconception. The heap is the most obvious place to put variably-sized data, but

    • string literals are placed in read-only memory,
    • you can have a fixed-size buffer somewhere (global variable, local stack array) and put some variably-sized data inside as long as it fits,
    • low-level, you can use some alloca equivalent to allocate variably-sized data on the stack.