I'm trying to use the standard <regex>
library to match some Cyrillic words:
// This is a UTF-8 file.
std::locale::global(std::locale("en_US.UTF-8"));
string s {"Каждый охотник желает знать где сидит фазан."};
regex re {"[А-Яа-яЁё]+"};
for (sregex_iterator it {s.begin(), s.end(), re}, end {}; it != end; it++) {
cout << it->str() << "#";
}
However, that doesn't seem work. The code above results in the following:
Кажд�#й#о�#о�#ник#желае�#зна�#�#где#�#иди�#�#азан#
rather than the expected:
Каждый#охотник#желает#знать#где#сидит#фазан
The code of the '�' symbol above is \321
.
I've checked the regular expression I used with grep
and it works as expected. My locale is en_US.UTF-8
. Both GCC and Clang produce the same result.
Is there anything I'm missing? Is there a way to "tame" <regex>
so it would work with Cyrillic characters?
For ranges like А-Я
to work properly, you must use std::regex::collate
Constants
...
collate Character ranges of the form "[a-b]" will be locale sensitive.
Changing the regular expression to
std::regex re{"[А-Яа-яЁё]+", std::regex::collate};
gives the expected result.
Depending on the encoding of your source file, you might need to prefix the regular expression string with u8
std::regex re{u8"[А-Яа-яЁё]+", std::regex::collate};