Search code examples
unicodeutf-8decodeencode

How is text saved in memory?


Supposed there is a utf8-encoded file:

file1.txt

汉字

which binary representation is:

11100110 10110001 10001001 11100101 10101101 10010111

If I open it with an editor, which will read the bit sequence and decode it. I can see 汉字 in editor, and 汉字 will be saved in memory.

Then, now

  • what is the bit sequence? Is it the same as above?
  • Does it depend on platform?
  • Is result ever same with various encoded text?

Solution

  • As so often, the answer is "it depends".

    Generally speaking in-memory text has to use some encoding just like on-disk text does.

    But whether that encoding is the same as the on-disk one or not depends on the application.

    Some might have a preferred encoding that they will represent the text in memory (such as UTF-16 or even UCS-4 if they are feeling wasteful) and others might hold it in-memory in the same encoding as used on-disk and just interpret it as necessary when rendering/searching.

    There's no universal rule that requires one approach or another. Some languages/platforms have a strong preference.

    For example Java uses UTF-16 for in-memory String objects (except as an internal optimization it might sometimes use Latin-1 if the text allows it).