Further to my previous question: ECMAScript Regex for a multilined string, I have implemented the following loading procedure:
void Load( const std::string& szFileName )
{
static const std::regex regexObject( "=== ([^=]+) ===\\n((?:.|\\n)*)\\n=== END \\1 ===", std::regex_constants::ECMAScript | std::regex_constants::optimize );
static const std::regex regexData( "<([^>]+)>:([^<]*)\\n", std::regex_constants::ECMAScript | std::regex_constants::optimize );
std::ifstream inFile( szFileName );
inFile.exceptions( std::ifstream::badbit );
std::string szFileData( (std::istreambuf_iterator<char>(inFile)), (std::istreambuf_iterator<char>()) );
inFile.close();
std::vector<std::future<void>> vecFutures;
for( std::sregex_iterator itObject( szFileData.cbegin(), szFileData.cend(), regexObject ), end; itObject != end; ++itObject )
{
if( (*itObject)[1] == "OBJECT1" )
{
vecFutures.emplace_back( std::async( []( std::string szDataString ) {
for( std::sregex_iterator itData( szDataString.cbegin(), szDataString.cend(), regexData ) { // Do Stuff }
}, (*itObject)[2].str() ) );
}
else if( (*itObject)[1] == "OBJECT2" )
{
vecFutures.emplace_back( std::async( []( std::string szDataString ) {
for( std::sregex_iterator itData( szDataString.cbegin(), szDataString.cend(), regexData ) { // Do Stuff }
}, (*itObject)[2].str() ) );
}
}
for( auto& future : vecFutures )
{
future.get();
}
}
However, loading it with this file results in a Stack Overflow (parameters: 0x00000001, 0x00332FE4):
=== OBJECT2 ===
<Name>:Test Manufacturer
<Supplier>:Test Supplier
<Address>:Test Multiline
Contact
Address
<Email>:[email protected]
<Telephone Number>:0123456789
=== END OBJECT2 ===
=== OBJECT1 ===
<Number>:1
<Name>:Test
<Location>:Here
<Manufacturer>:
<Model Number>:12345
<Serial Number>:54321
<Owner>:Me
<IP Address>:0.0.0.0
=== END OBJECT1 ===
I have been unable to find the source of the Stack Overflow but it looks like the outer std::sregex_iterator
loop is responsible.
Thanks in advance!
Here's another attempt:
=== ([^=]+) ===\n((?:(?!===)[^\n]+\n)+)=== END \1 ===
In your C++ it would obviously be written as:
=== ([^=]+) ===\\n((?:(?!===)[^\\n]+\\n)+)=== END \\1 ===
It's made for minimal backtracking (at least when matching), although I'm a bit Mr. Tired-Face at the moment, so probably missed quite a few ways to improve it.
It makes two assumptions, which are used to avoid a lot of backtracking (that possibly causes the stack overflow, as others have said):
===
at the start of a line, except for the start/end marker lines.?!
). It should, considering it's ECMAScript dialect.Explained:
=== ([^=]+) ===\n
Match and capture the object start marker. The [^=]
is one way to avoid a relatively small amount of backtracking here, same as yours - we're not using [^ ]
, because I do not know if there may be spaces in the OBJECT id.
((?:
Start capturing group for data. Inside it, a non-capturing group, because we're going to match each line individually.
(?!===)
Negative lookahead - we don't want ===
at the start of our captured line.
[^\n]+\n
Matches one line individually.
)+)
Match at least one line between start and end markers, then capture ALL the lines in a single group.
=== END \1 ===
Match the end marker.
Comparison (using RegexBuddy):
Original version:
Every added object will cause the amount of steps to grow for the previous ones. E.g., adding one more object (copy of object 2, renamed to 3) will result in: 2203 steps, 1322 steps, 425 steps.
This version: