Search code examples
phpregexhtml-parsinggreedy

PHP regex preg_match_all greedy modifier


Here is my code:

echo "<br />";
preg_match_all("|<[^>]+>.*</[^>]+>|U",
    "<b>example:</b><strong>this is a test</strong>",
    $out, PREG_PATTERN_ORDER);
print_r($out);
echo "<br />";

echo "<br />";
preg_match_all("|<[^>]+>.*</[^>]+>|",
    "<b>example:</b><strong>this is a test</strong>",
    $out, PREG_PATTERN_ORDER);
print_r($out);
echo "<br />";

There is something I do not understand. What difference that is make when there is a U at the end of the regex?

The output is:

Array ( [0] => Array ( [0] => example: [1] => this is a test ) )

Array ( [0] => Array ( [0] => example:this is a test ) )

So what is happening here really? Which version is the greedy version and why?


Solution

  • The U tells your regular expression to be "Ungreedy". Greedy means to try to match as much as possible whereas "ungreedy" only takes the smallest match.

    So in the greedy example your match is:

    <b>example:</b><strong>this is a test</strong>
    

    I assume the html tags "</b><strong>" are stripped away either when you output it or by the preg_match already.

    In contrast the ungreedy does what you want by matching like this:

    <b>example:</b>, <strong>this is a test</strong>
    

    EDIT:

    To achieve a similar match using the ? you can do:

    preg_match_all("|<[^>/]+>.*?</[^>]+>|",
        "<b>example:</b><strong>this is a test</strong>",
        $out, PREG_PATTERN_ORDER);
    print_r($out);
    

    This is because .*? will try to limit the content in between the tag to be as short as possible (ungreedy), therefore again resulting in two matches.