Search code examples
c++unicodecharacter-encodingutficu

Detect charset of file dynamically in c++


I am trying to read a file which may have any charset/codePage, but I don't which locale to set in order to read the file correctly.

Below is my code snippet in which I am trying to read a file having charset as windows-1256, but I want to get the charset dynamically from the file being read so that I can set the locale accordingly.

std::wifstream input{ filename.c_str() };
std::wstring content{ std::istreambuf_iterator<wchar_t>(input1), std::istreambuf_iterator<wchar_t>() };
input.imbue(std::locale(".1256"));
contents = ws2s(content); // Convert wstring to CString

Solution

  • In general, this is impossible to do accurately using the content of a plain text file alone. Usually you should rely on some external information. For example, if the file was downloaded with HTTP, the encoding should be received within a response header.

    Some files may contain information about the encoding as specified by the file format. XML for example: <?xml version="1.0" encoding="XXX"?>.

    Unicode encodings can be detected if the file starts with a Byte Order Mark - which is optional.

    You can usually assume that the encoding uses a wide character if the file contains a zero byte - which would represent the string terminator as a narrow character - before the end of the file. Likewise if you find two consecutive zeroes aligned to a 2 byte boundary (before the end), then the encoding is probably 4 bytes wide.

    Other than that, you could try to guess the encoding based on the frequency of certain characters. This can have some unintended consequences.