I read around 20 questions and checked documentation about it with no success, I don't have any experience writing code handling this stuff, I always avoided it.
Let's say I have a file which I am sure always will be UTF-8:
á
Let's say I have code:
wifstream input{argv[1]};
wstring line;
getline(input, line);
When I debug it, I see it's stored as L"á"
, so basically it's not iterable as I want, I want to have just 1 symbol to be able to call let's say iswalnum(line[0])
.
I realized that there is some codecvt facet, but I am not sure, how to use it and if it's the best way and I use cl.exe from VS2019 which gives me a lot of conversion and deprecation errors on the example provided: https://en.cppreference.com/w/cpp/locale/codecvt_utf8
I realized that there is a from_bytes function, but I use cl.exe from VS2019 which gives me a lot of errors on the example provided, too: https://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes
So how to correctly read the line with let's say that letter (symbol) á
and be able to iterate it as some container with size 1 so some function like iswalnum
can be simply called?
EDIT: When I fix the bugs in those examples (for c++latest), I still have á
in UTF-8 and á
in UTF-16.
L"á"
means the file was read with a wrong encoding. You have to imbue a UTF-8 locale before reading the stream.
wifstream input{argv[1]};
input.imbue(std::locale("en_US.UTF-8"));
wstring line;
getline(input, line);
Now wstring line
will contain Unicode code points (á
in your case) and can be easily iterated.
Caveat: on Windows wchar_t
is deficient (16-bit), and is good enough for iterating over BMP only.