Search code examples
regexpcre

Why is this negative lookahead not working?


I have this regex that is supposed to help me find and replace deprecated mysql queries. For some reason though, once I replace one query, it recaptures the same area and merely extends it to the end of the next deprecated query. I'm trying to solve this by not letting it select a specific keyword that is in the replacement string (stmt), but its ignoring this constraint for some reason.

\(?mis)\$(?<sql>[a-z0-9]*) = (?<query>"select.*?\;)(?:(?!stmt).*?)\$(?<res>[a-z0-9]*?) = mysql_query.*?\;(?<txt>.*?)while\s?\(\$(?<row>[a-z0-9]*?) = mysql.*?\{\ 1

Here is the Regex101 I'm using to debug.

(?:(?!stmt).*?) is the lookahead in question. I want it to allow for an arbitrary amount of text in between the named capture groups before and after.2 The *? should already be forcing it to find the smallest section. As you can see below, there is a perfectly acceptable match starting on line 14 ($sql = "SELECT admin from user where id=" . $userID;), but it is insisting on starting all the way at the top with the old, and already replaced match.

Why is my negative lookahead not working the way I think it should be working?3


enter image description here


1. I'm using (?mis) because PHPStorm doesn't play nice with normal flags.
2. To prevent random code and bad formatting from getting in the way of the pattern
3. If this is an XY problem and I should be forcing the correct match a different way, I'll welcome that as an answer instead.


Solution

  • The negative lookahead is not matching is because it's not matching. It denies a match when the semicolon at the end of the <query> is immediately followed by "stmt", which isn't the case in your code: it's followed by newline, whitespace, dollar sign, then "stmt".

    You can fix that part by extending the negative lookahead to (?!\s*\$stmt), but then the second problem becomes evident: that just extends the <query> match to the next semicolon, which isn't followed by a $stmt. You fix that by tightening the match in <query> to match greedily on non-semicolons, rather than non-greedily on anything. That is, (?<query>"select.*?\;) becomes (?<query>"select[^;]*\;). This creates a dead stop to the match at the first semicolon.

    This will fail to match if you have any semicolons inside your SQL, but hey.

    Does that get the desired result?