Search code examples
c++regexunicodere2

Google RE2 doesn't recognize Unicode escape in regex


I am developing an application, in C++, that validates configuration files with regex by using the Google RE2 library. The contents of the configuration files are read into an std::string;

So far, I declared this string that contains the regex expression:

const string EXPR_FAILED_FILE(R"([^\u0020-\u007E\n]|(\b.*(Mensagem|Antes|Loop|Movimentar|\|).*)|\\[0-9]{3,4})");

However, in this implementation below I am having some issues to detect some invalid characters in my test string (strInput)

bool checkStringConsistency(const string& strInput){
    RE2 re(EXPR_FAILED_FILE);
    bool b_matches = RE2::FullMatch(strInput, re);
    return b_matches;
}

When I run the code, I am getting these messages in the stderr:

re2/re2.cc:205: Error parsing '[^\u0020-\u007E\n]|(\b.*(Mensagem|Antes|Loop|Movimentar|\|).*)|\\[0-9]{3,4}': invalid escape sequence: \u
re2/re2.cc:890: Invalid RE2: invalid escape sequence: \u

It seems that the RE2 are not recognizing the \u sequence to seek a Unicode range of characters. I tested this expression at regexr.com and the invalid characters was detected normally there.

What could be wrong here?


Solution

  • Each regex engine has its own syntax and in RE2 you need to use [^\x{0020}-\x{007E}\n] instead of [^\u0020-\u007E\n]. See the syntax document:

    Escape sequences:
    \a  bell (== \007)
    \f  form feed (== \014)
    \t  horizontal tab (== \011)
    \n  newline (== \012)
    \r  carriage return (== \015)
    \v  vertical tab character (== \013)
    \*  literal «*», for any punctuation character «*»
    \123    octal character code (up to three digits)
    \x7F    hex character code (exactly two digits)
    \x{10FFFF}  hex character code
    \C  match a single byte even in UTF-8 mode
    \Q...\E literal text «...» even if «...» has punctuation
    

    \u is used to match an uppercase character and is marked as NOT SUPPORTED