Search code examples
c++utf-8isalpha

How to convert UTF-8 text from file to some container which can be iterable and check every symbol for being alphanumeric in C++?


I read around 20 questions and checked documentation about it with no success, I don't have any experience writing code handling this stuff, I always avoided it.

Let's say I have a file which I am sure always will be UTF-8:

á

Let's say I have code:

  wifstream input{argv[1]};
  wstring line;
  getline(input, line);

When I debug it, I see it's stored as L"á", so basically it's not iterable as I want, I want to have just 1 symbol to be able to call let's say iswalnum(line[0]).

I realized that there is some codecvt facet, but I am not sure, how to use it and if it's the best way and I use cl.exe from VS2019 which gives me a lot of conversion and deprecation errors on the example provided: https://en.cppreference.com/w/cpp/locale/codecvt_utf8

I realized that there is a from_bytes function, but I use cl.exe from VS2019 which gives me a lot of errors on the example provided, too: https://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes

So how to correctly read the line with let's say that letter (symbol) á and be able to iterate it as some container with size 1 so some function like iswalnum can be simply called?

EDIT: When I fix the bugs in those examples (for c++latest), I still have á in UTF-8 and á in UTF-16.


Solution

  • L"á" means the file was read with a wrong encoding. You have to imbue a UTF-8 locale before reading the stream.

      wifstream input{argv[1]};
      input.imbue(std::locale("en_US.UTF-8"));
      wstring line;
      getline(input, line);
    

    Now wstring line will contain Unicode code points (á in your case) and can be easily iterated.


    Caveat: on Windows wchar_t is deficient (16-bit), and is good enough for iterating over BMP only.