Search code examples
c++mingw-w64multibyte-functions

Why does `std::mbrlen` on mingw-w64 always return one (`1`)


When I compile the following source code in mingw-w64, I am always getting 1 (one) byte from the std::mbrlen:

#include <cstddef>
#include <cstdio>
#include <clocale>
#include <cstring>
#include <cwchar>

void print_mb(const char* ptr)
{
  std::size_t index{0};
  const char* end = ptr + std::strlen(ptr);
  int len;
  while((len = std::mbrlen(ptr, end-ptr, nullptr)) > 0)
  {
    std::printf("Character #%zu is %i bytes long.\n", index++, len);
    ptr += len;
  }
}

int main()
{
  std::setlocale(LC_ALL, "en_US.utf8");
  const char* str = "\x7a\xc3\x9f\xe6\xb0\xb4\xf0\x9d\x84\x8b";
  print_mb(str);
}

the sample code is based on code from std::mbrtowc page

After I have compiled this sample under mingw-w64 with

gcc sample.cxx

I get the following output from the program:

Character #0 is 1 bytes long.
Character #1 is 1 bytes long.
Character #2 is 1 bytes long.
Character #3 is 1 bytes long.
Character #4 is 1 bytes long.
Character #5 is 1 bytes long.
Character #6 is 1 bytes long.
Character #7 is 1 bytes long.
Character #8 is 1 bytes long.
Character #9 is 1 bytes long.

But if I compile the same code with "online" compiler on the cppreference page, for example, or with GCC under Arch Linux (again with simple gcc sample.cxx), or with Microsoft Visual C++ 2017 (cl sample.cxx), or with Intel C++ compiler 2018 (icl sample.cxx), I get this:

Character #0 is 1 bytes long.
Character #1 is 2 bytes long.
Character #2 is 3 bytes long.
Character #3 is 4 bytes long.

What may cause this behavior of the std::mbrlen under mingw-w64? Thanks.


My Microsoft Windows host is Microsoft Windows 10 x86-64. Compilation under mingw-w64, Microsoft Visual C++ and Intel C++ made on this host.


Solution

  • Windows does not support utf8 via the C and C++ locales.

    https://msdn.microsoft.com/en-us/library/x99tb11d.aspx

    The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8.

    Additionally, the locale names on Windows are different than on Linux, e.g. setlocale( LC_ALL, "English_United States.1252" );

    The C and C++ locale system is implementation defined, and the only usable implementation is the one in Linux (glibc).

    On Windows if you want UTF-8 or other Unicode stuff you need to resort to Windows API or to other libraries.