Search code examples
regexregex-negationboost-xpressive

Regular expression causing segfault/stack overflow


(or so I think)...

I'm using boost::xpressive as my regular expression engine to parse something and I get a segfault. I suspect that recursivity and my bad regular expression are to blame, because gdb shows more than 300 stack frames. So, here is my (case-sensitive) regular expression, in perl/python notation:

begin([^e]+)e((?:[^b]|b(?!egin))+)

which I expect to match

beginHEADER HEREeFOLLOWED BY SOME LONG LONG TEXT THAT GOES UNTIL NEXTbegin

with the first uppercase text (HEADER HERE) in the first group, and the second uppercase text in the second group. I always get the segfault if the text that should match group 2 is very long.

Why shouldn't this work?


Solution

  • You can simplify your regular expression a lot by simply using non-greedy matching:

    begin(.+?)e(.+?)begin
    

    Try that and see if it works better.

    It's likely that your original regex was causing a stack overflow due to a recursive implementation of the |-or grouping in your regex, which was potentially branching with every single character in your second group.

    A simple non-greedy .+? on the other hand doesn't need to branch for every single character.