Search code examples
c++encodingutf-8utf-32

Encoding independent input stream in C++


I have a C++ program which reads text files. Currently I'm using C's fopen() to read and then fgetc() to read the next character. I typedef'd a "file character", which is actually an int (and I can change it to long without problems, obviously).

Now the program can read UTF-7 and UTF-8 text files, but what if I use UTF-16 or UTF-32 text files? Is there a way to infer the file encoding and then read the file properly? Even passing to C++'s istream's wouldn't be a problem.


Solution

  • While you cannot definitely infer, in practice, you might try-and-fail based on a list of encodings.

    • UTF-16 will likely have a '\0' very early (whether at even or odd position(s) is decided by endianness, which might be little, big, or on some architectures, medium);
    • UTF-32 will likely have three of those; while
    • UTF-8 strings should not have this character.

    Additionally, utf files are permitted (but not required) to store a byte order mark: https://en.wikipedia.org/wiki/Byte_order_mark . If you have it, you are lucky, as that's different amongst the encodings.