Search code examples
regexpcre

Recurse subpattern doesn't seem to work with alternation


I want to match strings with numbers separated by commas. In a nutshell I want to match at most 8 numbers in range 1-16. So that string 1,2,3,4,5,6,7,8 is OK and 1,2,3,4,5,6,7,8,9 is not since it has 9 numbers. Also 16 is OK but 17 is not since 17 is not in range.

I tried to use this regex ^(?:(?:[1-9]|1[0-6]),){0,7}(?:[1-9]|1[0-6])$ and it worked fine. I use alternation to match numbers from 1-16, then I use 0..7 repetition with comma at the end and then the same without comma at the end. But I do not like the repetition of the subpattern, so I tried (?1) to recurse the first capturing group. My regex looks like ^(?:([1-9]|1[0-6]),){0,7}(?1)$. However this do not produce match when the last number has two letters (10-16). It does match 1,1, but not 1,10. I do not understand why.

I created an example of the problem.

https://regex101.com/r/VkuPqP/1

In the debugger I see that the engine do not try the second alternation from the group, when the pattern recurse. I expect it to work. Where is the problem?


Solution

  • That happens because the regex subroutines in PCRE are atomic.

    The regex you have can be re-written as ^(?:([1-9]|1[0-6]),){0,7}(?>[1-9]|1[0-6])$, see its demo. (?>...|...) will not allow backtracking into this group, so if the first branch "wins" (as in your example), the subsequent ones will not be tried upon the next subpattern failure (here, $ fails to match the end of string after matching 1 - it is followed with 0).

    You may swap the alternatives in this case, the longer should come first:

    ^(?:(1[0-6]|[1-9]),){0,7}(?1)$
    

    See the regex demo.

    In general, the best practice is that each alternative in a group must match in different location inside a string. They should not match at the same locations.

    If you can't rewrite an alternation group so that each alternative matches at unique locations in the string, you should repeat the group without using a regex subroutine.