Search code examples
c++regexunicodeencodingwidechar

Does `std::wregex` support utf-16/unicode or only UCS-2?


With the regex library was introduced into the standard library.

On the Windows/MSVC platform wchar_t has size of 2 (16 bit) and wchar_t* is normally when interfacing with the system/platform (eg. CreateFileW).

However it seems that std::regex isn't or does not support it, so I'm wondering whether std::wregex supports or just ?

I do not find any mention of this (Unicode or the like) in the documentation. In other languages normalization takes place.

The question is:

Is std::wregex representing when wchar_t has size of 2 ?


Solution

  • C++ standard doesn't enforce any encoding on std::string and std::wstring. They're simply a series of CharT. Only std::u8string, std::u16string and std::u32string have defined encoding

    Similarly std::regex and std::wregex also wrap around std::basic_string and CharT. Their constructors accept std::basic_string and the encoding being used for std::basic_string will also be used for std::basic_regex. So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex and std::string will be UTF-8 (yes, modern Windows does support UTF-8 locale)

    On Windows std::wstring uses UTF-16 so std::wregex also uses UTF-16. UCS-2 is deprecated and no one uses it anymore. You don't even need to differentiate between then since UCS-2 is just a subset of UTF-16 unless you use some very old tool that cuts in the middle of a surrogate pair. String searches in UTF-16 works exactly the same as in UCS-2 because UTF-16 is self-synchronized and a proper needle string can never match from the middle of a haystack. Same to UTF-8. If the tool doesn't understand UTF-16 then it's highly likely that it doesn't know that UTF-8 is variable length either, and will truncate the UTF-8 in the middle

    Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.

    https://en.wikipedia.org/wiki/UTF-8#Description

    The only things you need to care about are: avoid truncating in the middle of a character, and normalize the string before matching if necessary. The former issue can be avoided in UCS-2-only regex engines if you never use characters outside the BMP in a character class like commented. Replace them with a group instead

    In other languages normalization takes place.

    This is wrong. Some languages may do normalization before matching a regex, but that definitely doesn't apply to all "other languages"

    If you want a little bit more assurance then use std::basic_regex<char8_t> and std::basic_regex<char16_t> for UTF-8 and UTF-16 respectively. You'll still need a UTF-16-aware library though, otherwise that'll still only work for regex strings that only contain words

    The better solution may be changing to another library like ICU regex. You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support for each library

    Related:

    See also