Search code examples
pythonphpregexpcre

Why does Python regular expression give different result for the following 2 quantifiers?


While investigating the semantic difference between quantifiers based on length and count, I noticed that Python 3 regular expression gave different result for the following 2 regular expressions (notice the quantifiers + and *:

Python 3.10.16 (main, Dec  7 2024, 13:31:33) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.sub('(.{4,5})+', '-', '1234123412341234')
'-4'
>>> re.sub('(.{4,5})*', '-', '1234123412341234')
'--4-'
>>>

And I was able to reproduce this in PHP, presumably because they both use PCRE behind the back:

$ php -a
Interactive shell

php > echo preg_replace('/(.{4,5})+/', '-', '1234123412341234');
-4
php > echo preg_replace('/(.{4,5})*/', '-', '1234123412341234');
--4-
php >

How come?


Solution

  • Python doesn't use PCRE, so that's not it.

    (.{4,5})+

    first matches "12341", tben extends to include "23412", then extends to include "34123". It can't do more, so the first 15 characters are replaced by "-". It can't match the trailing "4", so that's left alone.

    (.{4,5})*

    matches the first 15 characters the same way. But then it's left with a trailing "4". And the empty string does match, because * means 0 or more matches, and 0 counts. So you get a second "-". The trailing "4" itself doesn't match, so it's left alone. Then the empty string after the trailing "4" matches (0 times again), so you get a final "-".