Search code examples
regexqtqt5pcreqregularexpression

QRegularExpression lazy matching not working for very large strings


I'm using QRegularExpression in Qt 5.10.1 to extract sections of text from files that are bound by a header and footer. For example, consider the following text:

...
begin
    some text
    some more text
    ...
end
...
begin
    etc.

I would then use the following regex to capture a section of text:

^begin\n([\s\S]+?)^end

Nothing out of the ordinary here. The problem is if the section of text is very large (over 100k lines), then the regex stops producing a match. I tried the search in a different text editor (TextPad) and it works fine, so I suspect it is due to some sort of MAX_SIZE constant in QRegularExpression or more likely the PCRE2 library it uses. But I have no idea where to look or if this is something I can tweak? Or maybe this is considered a bug?

Below is some code that can be used to demonstrate my issue. For me it bombs out at 100,000 lines (10,000,000 bytes).

QString s = "This line of text is exactly one hundred bytes long becuase it's a nice round number for this test.\n";
QRegularExpression re = QRegularExpression(R"(^begin\n([\s\S]+?)^end)", QRegularExpression::MultilineOption);
qDebug() << "start check:";
for (int i=10000; i<200000; i=i+1000) {
    QString test = "begin\n" + s.repeated(i) + "end\n";
    QRegularExpressionMatch match = re.match(test);
    if (!match.hasMatch()) {
        qDebug() << "lazy match failed - trying greedy match";
        re.setPattern(R"(^begin\n([\s\S]+)^end)");
        QRegularExpressionMatch match = re.match(test);
        qDebug() << match.hasMatch();
        break;
    }
    qDebug() << i;
}

Solution

  • So it turns out the PCRE2 library implemented by QRegularExpression has a MATCH_LIMIT variable that defaults to 10,000,000 (in the config.h file of the library). This combined with the nature of 'lazy' matching (in that advancing the search forward by one character counts as a match towards the MATCH_LIMIT) explains what I was seeing. This is unfortunate because I thought the performance of the lazy matching was very good in this example.

    The PCRE2 library allows the MATCH_LIMIT variable to be overridden for a search, but this feature is not implemented in QRegularExpression. I could patch the Qt library or change PCRE2 library default and re-build, but for now I've found an alternative (and much harder to understand) regex based on a good article here:

    ^begin\n((?:[^\n]++|\n(?!end))*+)\nend