With c++11 the regex library was introduced into the standard library.
On the Windows/MSVC platform wchar_t
has size of 2 (16 bit) and wchar_t*
is normally utf-16 when interfacing with the system/platform (eg. CreateFileW
).
However it seems that std::regex
isn't utf-8 or does not support it, so I'm wondering whether std::wregex
supports utf-16 or just ucs2 ?
I do not find any mention of this (Unicode or the like) in the documentation. In other languages normalization takes place.
The question is:
Is std::wregex
representing ucs2 when wchar_t
has size of 2 ?
C++ standard doesn't enforce any encoding on std::string
and std::wstring
. They're simply a series of CharT
. Only std::u8string
, std::u16string
and std::u32string
have defined encoding
Similarly std::regex
and std::wregex
also wrap around std::basic_string
and CharT
. Their constructors accept std::basic_string
and the encoding being used for std::basic_string
will also be used for std::basic_regex
. So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex
and std::string
will be UTF-8 (yes, modern Windows does support UTF-8 locale)
On Windows std::wstring
uses UTF-16 so std::wregex
also uses UTF-16. UCS-2 is deprecated and no one uses it anymore. You don't even need to differentiate between then since UCS-2 is just a subset of UTF-16 unless you use some very old tool that cuts in the middle of a surrogate pair. String searches in UTF-16 works exactly the same as in UCS-2 because UTF-16 is self-synchronized and a proper needle string can never match from the middle of a haystack. Same to UTF-8. If the tool doesn't understand UTF-16 then it's highly likely that it doesn't know that UTF-8 is variable length either, and will truncate the UTF-8 in the middle
Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.
The only things you need to care about are: avoid truncating in the middle of a character, and normalize the string before matching if necessary. The former issue can be avoided in UCS-2-only regex engines if you never use characters outside the BMP in a character class like commented. Replace them with a group instead
In other languages normalization takes place.
This is wrong. Some languages may do normalization before matching a regex, but that definitely doesn't apply to all "other languages"
If you want a little bit more assurance then use std::basic_regex<char8_t>
and std::basic_regex<char16_t>
for UTF-8 and UTF-16 respectively. You'll still need a UTF-16-aware library though, otherwise that'll still only work for regex strings that only contain words
The better solution may be changing to another library like ICU regex. You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support for each library
Related:
See also