I'm using QRegularExpression
in Qt 5.10.1 to extract sections of text from files that are bound by a header and footer. For example, consider the following text:
...
begin
some text
some more text
...
end
...
begin
etc.
I would then use the following regex to capture a section of text:
^begin\n([\s\S]+?)^end
Nothing out of the ordinary here. The problem is if the section of text is very large (over 100k lines), then the regex stops producing a match. I tried the search in a different text editor (TextPad) and it works fine, so I suspect it is due to some sort of MAX_SIZE constant in QRegularExpression
or more likely the PCRE2 library it uses. But I have no idea where to look or if this is something I can tweak? Or maybe this is considered a bug?
Below is some code that can be used to demonstrate my issue. For me it bombs out at 100,000 lines (10,000,000 bytes).
QString s = "This line of text is exactly one hundred bytes long becuase it's a nice round number for this test.\n";
QRegularExpression re = QRegularExpression(R"(^begin\n([\s\S]+?)^end)", QRegularExpression::MultilineOption);
qDebug() << "start check:";
for (int i=10000; i<200000; i=i+1000) {
QString test = "begin\n" + s.repeated(i) + "end\n";
QRegularExpressionMatch match = re.match(test);
if (!match.hasMatch()) {
qDebug() << "lazy match failed - trying greedy match";
re.setPattern(R"(^begin\n([\s\S]+)^end)");
QRegularExpressionMatch match = re.match(test);
qDebug() << match.hasMatch();
break;
}
qDebug() << i;
}
So it turns out the PCRE2 library implemented by QRegularExpression
has a MATCH_LIMIT
variable that defaults to 10,000,000 (in the config.h file of the library). This combined with the nature of 'lazy' matching (in that advancing the search forward by one character counts as a match towards the MATCH_LIMIT
) explains what I was seeing. This is unfortunate because I thought the performance of the lazy matching was very good in this example.
The PCRE2 library allows the MATCH_LIMIT
variable to be overridden for a search, but this feature is not implemented in QRegularExpression
. I could patch the Qt library or change PCRE2 library default and re-build, but for now I've found an alternative (and much harder to understand) regex based on a good article here:
^begin\n((?:[^\n]++|\n(?!end))*+)\nend