I need a regex that matches a specific capturing group which falls inside a multiline comment /* ... */.
In particular I need to find PHP variable definitions inside multiline comments
for example:
/* other code $var = value1 */
$var = value2 ;
/*
other code
$var = value3 ;
other code
*/
must match only the two occurences of '$var =' inside the comments but not the one outside the comment.
for the above example I wrote a regex that uses unrestricted lookbehind, like this
(?<=[/][\*][^/]+)(\$var) | (?<=[/][\*][^\*]+)(\$var)
but this regex fails in case it finds both charachter * and / even if they are APART from one another, between the comment opening tag '/*' and $var, which is not the desired bahaviour:
for example it fails in the case:
$var = .... ;
/*
other * code /
$var = .... ;
other code
*/
bacause it finds both '*' and '/' even if it's not the comment closing tag.
The key point is that I cannot negate a token which is combination of two charachter, but can only negate them one by one: [^*] or [^/].
...furthermore I cannot use the token [\s\S] instead of [^/] and [^*] because it would select $var out of comments preceded by a previous block of comment.
Any ideas? Is it even possibile with normal regex to achieve this? Or would I need something different?
Idea by use of \G to glue matches to /*
(?:/\*|\G(?!^))(?:(?!\*/)[^$])*\K\$var\s*=\s*(?:(?!\*/)[^$;])*
Might be hard to understand if you aren't doing a lot with regexes. See regex101 for demo.
\G
can be seen as "glue", it is continuing at the end of a previous match. But \G
also matches the start of the string. That's why the negative lookahead is used \G(?!^)
only need to continue.
/\*|\G(?!^)
This part is to find the beginning of a match at /*
or continue matching.
(?:(?!\*/)[^$])*
Match any ammount of characters that are not $
(negated class) while not ending the comment (?!\*/)
for stuff before/between $var
\K\$var
\K
resets beginning of the reported match before $var
occurs. \K
can be useful as an alternative to a variable width lookebhind which is not available in pcre.
\s*=\s*(?:(?!\*/)[^$;])*
to match the value of the variable. This is far from perfect. Would need modification if quoted values or not convenient for your input. After =
it matches [^$;]
characters, that are not dollar or semicolon (?!\*/)
as long there's no */
ahead.
This regex does not check if there is actually a comment-end */
it just binds matches to /*
Another idea would be to use kind of this trick with verbs (*SKIP)(*FAIL)
like in this demo.