Search code examples
phpregexpreg-match

Weird PHP Regex Preg_Match Bug?


My PHP version is PHP 7.2.24-0ubuntu0.18.04.7 (cli). However it looks like this problem occurs with all versions I've tested.

I've encountered a very weird bug when using preg_match. Anyone know a fix?

The first section of code here works, the second one doesn't. But the regex itself is valid. For some reason the something_happened word is causing it to fail.

$one = ' (branch|leaf)';
echo "ONE:\n";
preg_match('/(?:\( ?)?((?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+(?: ?\| ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?: ?\))?/', $one, $matches, PREG_OFFSET_CAPTURE);
print_r($matches); // this works

$two = 'something_happened (branch|leaf)';
echo "\nTWO:\n";
preg_match('/(?:\( ?)?((?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+(?: ?\| ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?: ?\))?/', $two, $matches2, PREG_OFFSET_CAPTURE);
print_r($matches2); // this doesn't work

It seems somehow related to the word something_happened. If I change this word it works.

The regex is matching 2 or more type names separated by | that may or may not be surrounded in (), and each type name may or may not be preceded by any number of [] (or [some number] or [!some number]) and *.

Try it and see for yourself! Please let me know if you know how to fix it!


Solution

  • The problem lies in the (?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+ group: the + quantifier quantifies a group with many subsequent optional patterns, and that creates too many options to match a string before the subsequent patterns.

    In PHP, you can workaround the problem by using either

    1. Possessive quantifier:
    '/(?:\(\ ?)?((?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)++(?:\ ?\|\ ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?:\ ?\))?/'
    

    Note the ++ at the end of the group mentioned. 2. Atomic group:

    '/(?:\(\ ?)?((?>(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+(?:\ ?\|\ ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?:\ ?\))?/'
    

    See this regex demo. Note the (?>...) syntax.

    Also, note how the regex is formatted here, it is very convenient to use the x (extended) flag to break the regex into several lines, format it, so that it could be easier to track down the issue. It is required to escape all literal whitespace and # chars, but it is a minor inconvenience when it comes to debugging long patterns like this.