Search code examples
c++c++17utfstd-filesystem

UTF8 to UTF16 conversion using std::filesystem::path


Starting from C++11 one can convert UTF8 to UTF16 wchar_t (at least on Windows, where wchar_t is 16 bit wide) using std::codecvt_utf8_utf16:

std::wstring utf8ToWide( const char* utf8 )
{
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
    return converter.from_bytes( utf8 );
}

Unfortunately in C++17, std::codecvt_utf8_utf16 is deprecated. But there is std::filesystem::path with all possible conversions inside, e.g. it has members

std::string string() const;
std::wstring wstring() const;
std::u8string u8string() const;
std::u16string u16string() const;
std::u32string u32string() const;

So the above function can be rewritten as follows:

std::wstring utf8ToWide( const char* utf8 )
{
    return std::filesystem::path( (const char8_t*) utf8 ).wstring();
}

And unlike std::codecvt_utf8_utf16 this will not use any deprecated piece of C++.

What kind of drawbacks can be expected from such converter? For example, path cannot be longer than certain length or certain Unicode symbols are prohibited there?


Solution

  • What kind of drawbacks can be expected from such converter?

    Well, let's get the most obvious drawback out of the way. For a user who doesn't know what you're doing, it makes no sense. Doing UTF-8-to-16 conversion by using a path type is bonkers, and should be seen immediately as a code smell. It's the kind of awful hack you do when you are needlessly averse to just downloading a simple library that would do it correctly.

    Also, it doesn't have to work. path is meant for storing... paths. Hence the name. Specifically, they're meant for storing paths in a way easily consumed by the filesystem in question. As such, the string stored in a path can have any limitations that the filesystem wants to put on it, outside of a small plethora of things the C++ standard requires it to do.

    For example, if the filesystem is case-insensitive (or even just ASCII-case-insensitive), it is a legitimate implementation to have it just case-convert all strings to lowercase when they are stored in a path. Or to case-convert them when you extract them from a path. Or anything of the like.

    path can convert all of your \s into /s. Or your :s into /'s. Or any other implementation-dependent tricks it wants to do.

    If you're afraid of using a deprecated facility, just download a simple UTF-8/16 converting library. Or write one yourself; it isn't that difficult.