Search code examples
c++regexlocaleregular-languagecyrillic

<regex> having trouble with Cyrillic characters


I'm trying to use the standard <regex> library to match some Cyrillic words:

  // This is a UTF-8 file.
  std::locale::global(std::locale("en_US.UTF-8"));

  string s {"Каждый охотник желает знать где сидит фазан."};
  regex re {"[А-Яа-яЁё]+"};

  for (sregex_iterator it {s.begin(), s.end(), re}, end {}; it != end; it++) {
    cout << it->str() << "#";
  }

However, that doesn't seem work. The code above results in the following:

  Кажд�#й#о�#о�#ник#желае�#зна�#�#где#�#иди�#�#азан#

rather than the expected:

  Каждый#охотник#желает#знать#где#сидит#фазан

The code of the '�' symbol above is \321.

I've checked the regular expression I used with grep and it works as expected. My locale is en_US.UTF-8. Both GCC and Clang produce the same result.

Is there anything I'm missing? Is there a way to "tame" <regex> so it would work with Cyrillic characters?


Solution

  • For ranges like А-Я to work properly, you must use std::regex::collate

    Constants
    ...
    collate Character ranges of the form "[a-b]" will be locale sensitive.

    Changing the regular expression to

    std::regex re{"[А-Яа-яЁё]+", std::regex::collate};
    

    gives the expected result.


    Depending on the encoding of your source file, you might need to prefix the regular expression string with u8

    std::regex re{u8"[А-Яа-яЁё]+", std::regex::collate};