Search code examples
assemblyunicodeutf-8x86nasm

Does UTF-8 use same amount of memory as UTF-32 when pushed on the stack?


The question is specifically about how much space UTF-8 occupies on the stack and therefore in memory (RAM), as in: is it the same as UTF-32? So this is not about how much disk space UTF-8 takes when serialized to a file. Sorry if that attempt at disambiguation insulted your intellect.

  • stack is always in RAM. So anything I put on stack occupies space in RAM.

https://stackoverflow.com/questions/15433390/is-stack-in-cpu-or-ram#:~:text=Stack%20is%20always%20in%20RAM,at%20the%20top%20of%20stack.

  • stack is at least 32 bits on x86 and 64 bits on x86_64. So whether I push one-byte chars or three-byte chars onto stack, they all take at least 32 bits in memory. I imagine this is what happens with UTF-32, that it takes 32 bits on stack.

How many bytes does the push instruction push onto the stack when I don't specify the operand size?

So, what do they mean when they say UTF-32 takes more memory than UTF-8 ?

Edit

UTF-32 uses more memory, but today's computers are equipped with a lot of memory. The pressure to save memory is gone and the simple and fast handling of UTF-32 strings outweighs the increased memory usage. Using UTF-32 leads to faster programs than any approach that tries to save memory by examining strings.

https://seed7.sourceforge.net/faq.htm#unicode


Solution

  • In the weird case where you push multiple separate UTF-8 coding units (bytes), yes that would use 8 bytes of stack space per byte of UTF-8 data. But only in that case.

    That's horribly inefficient, which is why people don't write code that way (except for some simplistic beginner examples of using the stack to reverse a short string, as a learning exercise for understanding the LIFO ordering of push/pop).

    If you want to store string data in stack space, you'd reserve some space (like a local char array) and use it, not unpacking bytes or dwords to qwords. Like sub rsp, 64+8 / movdqu xmm0, [rsi] / movdqa [rsp], xmm0 to copy 16 bytes (of UTF-32 or UTF-8 data, doesn't matter which).

    If you really want to use push, you could push qword [rdi+rcx] to copy 8 bytes at a time while pushing, counting backwards from the end of the source string so the string ends up on the stack in the same order as the source.

    When you access the data, you can use mov eax, [rsp + rcx*4] for UTF-32 (or preferably a pointer increment, but the scale factor helps illustrate the addressing). Or for UTF-8, movzx eax, byte [rsp + rcx] (with a loop to check for multi-byte characters and potentially load more bytes, if you want to get a unicode code-point into EAX). Unpacking each byte of UTF-8 to 8 bytes makes zero sense, and makes it harder to efficiently handle multi-byte characters. e.g. with an 8-byte load and BMI2 pext to pack, and maybe andn / tzcnt / bzhi to find the end of the multi-byte character (a byte with its high bit clear) and zero the garbage above it.


    For normal ways to handle string data (keeping it packed the same way as on disk), UTF-8 is 4x smaller than UTF-32 for the ASCII subset of Unicode. For Western languages with some 2 and 3-byte accented characters but still mostly 1-byte characters, it's still a lot smaller. For languages where most characters are 3 or more bytes long in UTF-8, UTF-32 doesn't take much more space. (And expanding each byte of UTF-8 to 8 bytes vs. each dword of UTF-32 would make UTF-8 take way more space.)

    It can make sense to convert to UTF-32 on input and back to UTF-8 on output. Then we're back to the good old days where a character has a fixed size so array indexing can give us the nth character (modulo Unicode shenanigans separate from the variable-length encodings like UTF-8 and UTF-16). This does increase space usage, including cache footprint, especially for Western languages. RAM is cheap, but cache footprint and memory bandwidth aren't. So this isn't always the best strategy.