While investigating the semantic difference between quantifiers based on length and count, I noticed that Python 3 regular expression gave different result for the following 2 regular expressions (notice the quantifiers +
and *
:
Python 3.10.16 (main, Dec 7 2024, 13:31:33) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.sub('(.{4,5})+', '-', '1234123412341234')
'-4'
>>> re.sub('(.{4,5})*', '-', '1234123412341234')
'--4-'
>>>
And I was able to reproduce this in PHP, presumably because they both use PCRE behind the back:
$ php -a
Interactive shell
php > echo preg_replace('/(.{4,5})+/', '-', '1234123412341234');
-4
php > echo preg_replace('/(.{4,5})*/', '-', '1234123412341234');
--4-
php >
How come?
Python doesn't use PCRE, so that's not it.
(.{4,5})+
first matches "12341", tben extends to include "23412", then extends to include "34123". It can't do more, so the first 15 characters are replaced by "-". It can't match the trailing "4", so that's left alone.
(.{4,5})*
matches the first 15 characters the same way. But then it's left with a trailing "4". And the empty string does match, because *
means 0 or more matches, and 0 counts. So you get a second "-". The trailing "4" itself doesn't match, so it's left alone. Then the empty string after the trailing "4" matches (0 times again), so you get a final "-".