Search code examples
phpregexregex-negationregex-lookarounds

Find a word in multiline comment with one regex


I need a regex that matches a specific capturing group which falls inside a multiline comment /* ... */.

In particular I need to find PHP variable definitions inside multiline comments

for example:

/* other code $var = value1 */
$var = value2 ;

/* 
other code
$var = value3 ;
other code
*/

must match only the two occurences of '$var =' inside the comments but not the one outside the comment.

for the above example I wrote a regex that uses unrestricted lookbehind, like this

(?<=[/][\*][^/]+)(\$var) | (?<=[/][\*][^\*]+)(\$var)

but this regex fails in case it finds both charachter * and / even if they are APART from one another, between the comment opening tag '/*' and $var, which is not the desired bahaviour:

for example it fails in the case:

$var = .... ;

/* 
other * code /
$var = .... ;
other code
*/

bacause it finds both '*' and '/' even if it's not the comment closing tag.

The key point is that I cannot negate a token which is combination of two charachter, but can only negate them one by one: [^*] or [^/].

...furthermore I cannot use the token [\s\S] instead of [^/] and [^*] because it would select $var out of comments preceded by a previous block of comment.

Any ideas? Is it even possibile with normal regex to achieve this? Or would I need something different?


Solution

  • Idea by use of \G to glue matches to /*

    (?:/\*|\G(?!^))(?:(?!\*/)[^$])*\K\$var\s*=\s*(?:(?!\*/)[^$;])*
    

    Might be hard to understand if you aren't doing a lot with regexes. See regex101 for demo.

    \G can be seen as "glue", it is continuing at the end of a previous match. But \G also matches the start of the string. That's why the negative lookahead is used \G(?!^) only need to continue.

    • /\*|\G(?!^) This part is to find the beginning of a match at /* or continue matching.

    • (?:(?!\*/)[^$])* Match any ammount of characters that are not $ (negated class) while not ending the comment (?!\*/) for stuff before/between $var

    • \K\$var \K resets beginning of the reported match before $var occurs. \K can be useful as an alternative to a variable width lookebhind which is not available in pcre.

    • \s*=\s*(?:(?!\*/)[^$;])* to match the value of the variable. This is far from perfect. Would need modification if quoted values or not convenient for your input. After = it matches [^$;] characters, that are not dollar or semicolon (?!\*/) as long there's no */ ahead.

    This regex does not check if there is actually a comment-end */ it just binds matches to /*
    Another idea would be to use kind of this trick with verbs (*SKIP)(*FAIL) like in this demo.