c++regex c++11 stack-overflow standard-library

Regular Expression causing Stack Overflow

Further to my previous question: ECMAScript Regex for a multilined string, I have implemented the following loading procedure:

void Load( const std::string& szFileName )
{
     static const std::regex regexObject( "=== ([^=]+) ===\\n((?:.|\\n)*)\\n=== END \\1 ===", std::regex_constants::ECMAScript | std::regex_constants::optimize );
     static const std::regex regexData( "<([^>]+)>:([^<]*)\\n", std::regex_constants::ECMAScript | std::regex_constants::optimize );

     std::ifstream inFile( szFileName );
     inFile.exceptions( std::ifstream::badbit );

     std::string szFileData( (std::istreambuf_iterator<char>(inFile)), (std::istreambuf_iterator<char>()) );

     inFile.close();

     std::vector<std::future<void>> vecFutures;

     for( std::sregex_iterator itObject( szFileData.cbegin(), szFileData.cend(), regexObject ), end; itObject != end; ++itObject )
     {
          if( (*itObject)[1] == "OBJECT1" )
          {
               vecFutures.emplace_back( std::async( []( std::string szDataString ) {
                    for( std::sregex_iterator itData( szDataString.cbegin(), szDataString.cend(), regexData ) { // Do Stuff }
               }, (*itObject)[2].str() ) );
          }
          else if( (*itObject)[1] == "OBJECT2" )
          {
               vecFutures.emplace_back( std::async( []( std::string szDataString ) {
                    for( std::sregex_iterator itData( szDataString.cbegin(), szDataString.cend(), regexData ) { // Do Stuff }
               }, (*itObject)[2].str() ) );
          }
     }

     for( auto& future : vecFutures )
     {
          future.get();
     }
}

However, loading it with this file results in a Stack Overflow (parameters: 0x00000001, 0x00332FE4):

=== OBJECT2 ===
<Name>:Test Manufacturer
<Supplier>:Test Supplier
<Address>:Test Multiline
Contact
Address
<Email>:[email protected]
<Telephone Number>:0123456789
=== END OBJECT2 ===
=== OBJECT1 ===
<Number>:1
<Name>:Test
<Location>:Here
<Manufacturer>:
<Model Number>:12345
<Serial Number>:54321
<Owner>:Me
<IP Address>:0.0.0.0
=== END OBJECT1 ===

I have been unable to find the source of the Stack Overflow but it looks like the outer std::sregex_iterator loop is responsible.

Thanks in advance!

Solution

Here's another attempt:

=== ([^=]+) ===\n((?:(?!===)[^\n]+\n)+)=== END \1 ===

In your C++ it would obviously be written as:

=== ([^=]+) ===\\n((?:(?!===)[^\\n]+\\n)+)=== END \\1 ===

It's made for minimal backtracking (at least when matching), although I'm a bit Mr. Tired-Face at the moment, so probably missed quite a few ways to improve it.

It makes two assumptions, which are used to avoid a lot of backtracking (that possibly causes the stack overflow, as others have said):

That there's never a === at the start of a line, except for the start/end marker lines.
That C++ supports these regex features - specifically the use of a negative lookahead (?!). It should, considering it's ECMAScript dialect.

Explained:

=== ([^=]+) ===\n

Match and capture the object start marker. The [^=] is one way to avoid a relatively small amount of backtracking here, same as yours - we're not using [^ ], because I do not know if there may be spaces in the OBJECT id.

((?:

Start capturing group for data. Inside it, a non-capturing group, because we're going to match each line individually.

   (?!===)

Negative lookahead - we don't want === at the start of our captured line.

   [^\n]+\n

Matches one line individually.

)+)

Match at least one line between start and end markers, then capture ALL the lines in a single group.

=== END \1 ===

Match the end marker.

Comparison (using RegexBuddy):

Original version:

First match: 1277 steps
Failed match: 1 step (this is due to the line break between the objects)
Second match: 396 steps

Every added object will cause the amount of steps to grow for the previous ones. E.g., adding one more object (copy of object 2, renamed to 3) will result in: 2203 steps, 1322 steps, 425 steps.

This version:

First match: 67 steps
Failed match: 1 step (once again due to the line break between the objects)
Second match: 72 steps
Failed match: 1 step
Third match: 67 steps