Search code examples
c++regexpcremultilinepcregrep

Why does this regular expression match in pcregrep but not within my c++ code?


I have a regex that works perfectly with pcregrep:

pcregrep -M '([a-zA-Z0-9_&*]+)(\(+)([a-zA-Z0-9_ &\*]+)(\)+)(\n)(\{)'

Now I tried to include this regex in my C++ code but it does not match (escapes included):

char const *regex = "([a-zA-Z0-9_&*]+)\\(+([a-zA-Z0-9_ &\\*]+)\\)+(?>\n+)\\{+";
re = pcre_compile(regex, PCRE_MULTILINE, &error, &erroffset, 0);

I'm trying to find function bodies like this (the paragraph is 0a in hex):

my_function(char *str)
{

Why does it work with pcregrep and not within the C++ code?


Solution

  • Your first regex:

     ( [a-zA-Z0-9_&*]+ )           # (1)
     ( \(+ )                       # (2)
     ( [a-zA-Z0-9_ &\*]+ )         # (3)
     ( \)+ )                       # (4)
     ( \n )                        # (5)
     ( \{ )                        # (6)
    

    Your second regex:

     ( [a-zA-Z0-9_&*]+ )           # (1)
     \(+
     ( [a-zA-Z0-9_ &\*]+ )         # (2)
     \)+
     (?> \n+ )
     \{+
    

    Other than different capture groups and an unnecessary atomic group (?>)

    there is one thing that is obviously different:

    The last newline and curly brace in the second regex have + quantifiers.
    But that's 1 or more, so I think the first regex would be a subset of the second.

    The un-obvious difference is that it is unknown if the files were opened in translated mode or not.

    You can usually cover all cases with \r?\n in place of \n.
    (or even (?:\r?\n|\r) ).

    So, if you want to quantify the linebreak, it would be (?:\r?\n)+ or (?:\r?\n|\r)+.

    The other option might be to try the linebreak construct (I think its \R)
    instead (available on the newest versions of pcre).

    If that doesn't work, it's something else.