My PHP version is PHP 7.2.24-0ubuntu0.18.04.7 (cli)
. However it looks like this problem occurs with all versions I've tested.
I've encountered a very weird bug when using preg_match. Anyone know a fix?
The first section of code here works, the second one doesn't. But the regex itself is valid. For some reason the something_happened
word is causing it to fail.
$one = ' (branch|leaf)';
echo "ONE:\n";
preg_match('/(?:\( ?)?((?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+(?: ?\| ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?: ?\))?/', $one, $matches, PREG_OFFSET_CAPTURE);
print_r($matches); // this works
$two = 'something_happened (branch|leaf)';
echo "\nTWO:\n";
preg_match('/(?:\( ?)?((?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+(?: ?\| ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?: ?\))?/', $two, $matches2, PREG_OFFSET_CAPTURE);
print_r($matches2); // this doesn't work
It seems somehow related to the word something_happened
. If I change this word it works.
The regex is matching 2 or more type names separated by |
that may or may not be surrounded in ()
, and each type name may or may not be preceded by any number of []
(or [some number]
or [!some number]
) and *
.
Try it and see for yourself! Please let me know if you know how to fix it!
The problem lies in the (?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+
group: the +
quantifier quantifies a group with many subsequent optional patterns, and that creates too many options to match a string before the subsequent patterns.
In PHP, you can workaround the problem by using either
'/(?:\(\ ?)?((?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)++(?:\ ?\|\ ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?:\ ?\))?/'
Note the ++
at the end of the group mentioned.
2. Atomic group:
'/(?:\(\ ?)?((?>(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+(?:\ ?\|\ ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?:\ ?\))?/'
See this regex demo. Note the (?>...)
syntax.
Also, note how the regex is formatted here, it is very convenient to use the x
(extended) flag to break the regex into several lines, format it, so that it could be easier to track down the issue. It is required to escape all literal whitespace and #
chars, but it is a minor inconvenience when it comes to debugging long patterns like this.