Search code examples
unicodeutf-8character-encodingiso-8859-1

What determines how strings are encoded in memory?


Say we have a file that is Latin-1 encoded and that we use a text editor to read in that file into memory. My questions are then:

  • How will those character strings be represented in memory? Latin-1, UTF-8, UTF-16 or something else?
  • What determines how those strings are represented in memory? Is it the application, the programming language the application was written in, the OS or the hardware?

As a follow-up question:

  • How do applications then save files to encoding schemes that use different character sets? F.e. converting UTF-8 to UTF-16 seems fairly intuitive to me as I assume you just decode to the Unicode codepoint, then encode to the target encoding. But what about going from UTF-8 to Shift-JIS which has a different character set?

Solution

  • Operating system

    Programming language

    Depends on their age or on their compiler: while languages themselves are not necessarily bound to an OS the compiler which produces the binaries might treat things differently as per OS.

    Application/program

    Depends on the platform/OS. While the in-memory consumption of text is strongly influenced by the programming language compiler and the data types used there, using libraries (which could have been produced by entirely other compilers and programming languages) can mix this.

    Strictly speaking the binary file format also has its strict encodings: on Windows the PE (used in EXE, DLL, etc.) has resource Strings in 16 bit characters again. So while f.e. the Free Pascal Compiler can (as per language) make heavy use of UTF-8 it will still build an EXE file with UTF-16 metadata in it.

    Programs that deal with text (such as editors) will most likely hold any encoding "as is" in memory for the sake of performance, surely with compromises such as temporarily duplicating parts into Strings of 32 bit per character, just to quickly search through it, let alone supporting Unicode normalization.

    Conversion

    The most common approach is to use a common denominator:

    • Either every input is decoded into 32 bit characters which are then encoded into the target. Costs the most memory, but makes it easy to deal with.
    • In the WinAPI you either convert to UTF-16 via MultiByteToWideChar(), or from UTF-16 via WideCharToMultiByte(). To go from UTF-8 to Shift-JIS you'd make a sidestep from UTF-8 to UTF-16, then from UTF-16 to Shift-JIS. Support for all the encodings shift as per version and localized installation, there's not really a guarantee for all of them.
    • External libraries specialized on encodings alone can do this, like iconv - these support many encodings unbound to the OS support.