when i use boost::regex_search function to do something like this
std::string sTest = "18";
std::string sRegex = "^([\\u4E00-\\u9FA5]+)(\\d+)(#?)$";
std::string::const_iterator iterStart = sTest.begin();
std::string::const_iterator iterEnd = sTest.end();
boost::match_results<std::string::const_iterator> RegexResults;
while (boost::regex_search(iterStart, iterEnd, RegexResults, boost::regex(sRegex)))
{
int a = 1;
break;
}
however value 'stest' is matched,but when i use std::regex_search it's ok.
Assuming the question is serious:
the regex matches
^
(start of input)
What looks like you intended as one or more of the "CJK Unified Ideographs" block (though only from the 1.0.1 Unicode standard).
However, this is not what is parsed. (Instead, indeed it parses as regular hex escapes which does match 1
).
The
docs
tell me that you might have wanted \x{dddd}
but that requires
Unicode support.
Digging in more docs tell me that
There are two ways to use Boost.Regex with Unicode strings:
Rely on wchar_t
(lists a bunch of limitations and conditions)
Use a Unicode Aware Regular Expression Type.
May I suggest the latter
one or more numerical digits (according to the locale's character classicification)
(#?)
might have been intended as a comment ((?#)
) but as spelled optionally matches a single #
character
followed by $
(end of input)
That's not in your input, so it shouldn't match.
Besides, since this is a fully anchored pattern (^$
) it would only make sense
with regex_match
, not regex_search
.
The while loop is a weird idea, because the input never changes, so neither will
the search result. If there's a match, the loop always breaks. Your while
amounts to a more confusing if
statement.
Here's a program that codes three switchable approaches with three different regexes.
The approaches are BOOST_SIMPLE, BOOST_UNICODE and STANDARD_LIB.
The first regex is yours, the second with \x{XXXX}
escapes instead and the
third is using the named character class \p{InCJK_Unified_Ideographs}
.
The result is:
#include <iostream>
#include <iomanip>
#ifdef STANDARD_LIB
#include <regex>
using match = std::smatch;
using regex = std::regex;
#define ctor regex
#elif defined(BOOST_SIMPLE)
#include <boost/regex.hpp>
using match = boost::smatch;
using regex = boost::regex;
#define ctor regex
#elif defined(BOOST_UNICODE)
#include <boost/regex/icu.hpp>
using match = boost::smatch;
using regex = boost::u32match;
#define ctor boost::make_u32regex
#define regex_match boost::u32regex_match
#else
#error "Need to pick a flavour"
#endif
int main() {
std::string const sTest = "18";
struct testcase { std::string_view label, re; };
for (auto current : { testcase
{ "wrong unicode character class",
R"(^([\u4E00-\u9FA5]+)(\d+)(#?)$)" },
{ "correct unicode character class",
R"(^([\x{4E00}-\x{9FA5}]+)(\d+)(#?)$)" },
{ "named character class",
R"(^([\p{InCJK_Unified_Ideographs}]+)(\d+)(#?)$)" },
}) {
std::cout
<< std::string(current.label.length(), '=') << "\n"
<< current.label << " (" << std::quoted(current.re) << ")\n";
try {
match results;
bool is_match = regex_match(sTest, results, ctor(current.re.data()));
std::cout << "is_match: " << std::boolalpha << is_match << "\n";
if (is_match) {
std::cout << "$1: " << std::quoted(results[1].str()) << "\n";
std::cout << "$2: " << std::quoted(results[2].str()) << "\n";
std::cout << "$3: " << std::quoted(results[3].str()) << "\n";
}
} catch(std::exception const& e) {
std::cerr << "failure: " << e.what() << "\n";
}
}
}
Compiling with
g++ -DBOOST_SIMPLE -O2 -std=c++17 main.cpp -lboost_regex -o boost_simple
g++ -DBOOST_UNICODE -O2 -std=c++17 main.cpp -lboost_regex -o boost_unicode -licuuc
g++ -DSTANDARD_LIB -O2 -std=c++17 main.cpp -o Standard_lib
Has outputs:
File boost_simple.log
=============================
wrong unicode character class ("^([\\u4E00-\\u9FA5]+)(\\d+)(#?)$")
is_match: true
$1: "1"
$2: "8"
$3: ""
===============================
correct unicode character class ("^([\\x{4E00}-\\x{9FA5}]+)(\\d+)(#?)$")
failure: Hexadecimal escape sequence was invalid. The error occurred while parsing the regular expression fragment: '^([>>>HERE>>>\x{4E00}-\'.
=====================
named character class ("^([\\p{InCJK_Unified_Ideographs}]+)(\\d+)(#?)$")
is_match: false
File boost_unicode.log
=============================
wrong unicode character class ("^([\\u4E00-\\u9FA5]+)(\\d+)(#?)$")
is_match: true
$1: "1"
$2: "8"
$3: ""
===============================
correct unicode character class ("^([\\x{4E00}-\\x{9FA5}]+)(\\d+)(#?)$")
is_match: false
=====================
named character class ("^([\\p{InCJK_Unified_Ideographs}]+)(\\d+)(#?)$")
is_match: false
File standard_lib.log
=============================
wrong unicode character class ("^([\\u4E00-\\u9FA5]+)(\\d+)(#?)$")
failure: Invalid range in bracket expression.
===============================
correct unicode character class ("^([\\x{4E00}-\\x{9FA5}]+)(\\d+)(#?)$")
failure: Unexpected end of regex when ascii character.
=====================
named character class ("^([\\p{InCJK_Unified_Ideographs}]+)(\\d+)(#?)$")
is_match: false