Search code examples
phpregexreluctant-quantifiers

Using lazy evaluation on a large regular expression (and not just .*?)


Using the follow regex:

\[\w* \w* \d{2} [\w:]* \d{4}\] \[error\] \[client .*?\] .*? Using HTTP not .*?<br /> 

I get the following results (where yellow boxes indicate a match):

Sublime Text 2

Raw Text: http://pastebin.com/vSi0mLGv

The bottom two sections are correct. I want all sections that contain: &lt;&lt;&lt;NOTICE&gt;&gt;&gt; Non-Prod Server: Using HTTP not HTTP/S

The top section however, contains the correct string (similar to the bottom two), but also comes with a whole other chunk that I do not want:

[Thu May 10 17:43:48 2012] [error] [client ::1] Current Name:
DashboardBar_projAnnualReview200, referer: http://
localhost/test/pages/TestPage.php<br />`

I know this comes down to regex being greedy, but how can I go about making it do a lazy evaluation for the <br />, if that's even the right way to go about it. I've tried (<br />)*? and others to no avail.


Other Information: I am using Sublime Text 2, and performing a regex search if anyone wanted to recreate the image.


Solution

  • Greediness is not the problem, eagerness is. The regex engine starts trying to match at the earliest opportunity, and it doesn't give up until every possibility has been exhausted. Making quantifiers non-greedy doesn't change that, it just changes the order in which the possibilities are tried.

    It's not the * in .* that's causing your problem, it's the .. You need to use something more restrictive, because it's allowing the match to start too early. This regex works as desired because I've replaced the .*? with [^][]*, which matches any characters except ] or [:

    \[\w* \w* \d{2} [\w:]* \d{4}\] \[error\] \[client [^][]*\] [^][]* Using HTTP not .*?<br />
    

    I don't know what regex flavor SublimeText uses, so you may need to escape the square brackets inside the character class:

    \[\w* \w* \d{2} [\w:]* \d{4}\] \[error\] \[client [^\]\[]*\] [^\]\[]* Using HTTP not .*?<br />