Search code examples
phpregexregex-group

Regex : Unable to detect all the occurrences in a particular group when there is more than one


I'm trying to run a regular expression (in PHP) to detect and extract conditions with their matches, but I'm stuck on the elseif scenario (which can be recurring).

There is my current regular expression:

/\{\%if (.+)\%\}(.*)(?:\{\%elseif (.+)\%\}(.*)(?=\{\%))*?(?:\{\%else\%\}(.*))*\{\%endif\%\}/gsU

Here is the test I want to pass, currently failing on the last group of conditions (which you can also see on regex101) :

{%if $foo === "bar" && getThisCondition()%}
    My if result
{%endif%}

{%if $foo === "bar" && getThisCondition()%}
    My if result
{%else%}
    My else result
{%endif%}

{%if $foo === "bar" && getThisCondition()%}
    My if result
{%elseif $foo === "bar" && getThisCondition()%}
    My elseif result
{%endif%}

{%if $foo === "bar" && getThisCondition()%}
    My if result
{%elseif $foo === "baz" && !getThisCondition()%}
    My elseif result
{%else%}
    My else result
{%endif%}

{%if $foo === "bar" && getThisCondition()%}
    My if result
{%elseif $foo === "baz" && !getThisCondition()%}
    My elseif result
{%elseif $foo === "baf" && !getThisCondition()%}
    My elseif result
{%elseif $foo === "bak" && !getThisCondition()%}
    My elseif result
{%else%}
  My else result
{%endif%}

How can i make sure that all elseif occurrences are taken into account? When I isolate them (and removing *?) it works:

(?:\{\%elseif (.+)\%\}(.*)(?=\{\%))*?/gsU

... but if i put it back in the whole expression it doesn't work anymore.

What am I missing?


Solution

  • As said in comments, you can't retrieve all captures in a repeated capturing group in PHP since the capture content is overwritten each time the capture group is repeated.

    If it's not possible to catch all that you want in one match, that doesn't mean you can't do it in one pattern. You can use preg_match_all (or preg_replace_callback) to retrieve each part of your conditional statements with this kind of pattern that checks first if the full conditional statement is well-formed, and that grabs the different parts one by one:

    ~
    (?(DEFINE)
        (?<full> {%if\     \g<cond> %}  \g<cont>
             (?: {%elseif\ \g<cond> %}  \g<cont> )*
             (?: {%else%}               \g<cont> )?
                 {%endif%}
        )
        (?<cond> [^%]*+ (?: % (?!}) [^%]* )*+ )
        (?<cont> [^{]*+ (?: { (?!%) [^{]* )*+ )
    )
    
    (?J) # allow duplicate named captures
    (?=\g<full>) # check if a well formed if/elseif/else/endif is at this position
    {%if\ (?<condition> \g<cond> ) %} (?<content> \g<cont> )
    |
    \G (?<= {%endif%} ) (*SKIP)(*F) # break the contiguity after {%endif%}
    |
    \G {%elseif\ (?<condition> \g<cond> ) %} (?<content> \g<cont> )
    |
    \G {%else%} (?<content> \g<cont> )
    |
    \G {%endif%}
    ~xu
    

    demo

    After a successful match with the branch that starts with (?=\g<full>), all other parts of the statement are matched from a contigous position after a successful match with the anchor \G (note that the second branch \G (?<= {%endif%} ) (*SKIP)(*F) is here to break this contiguity once the end of the statement is reached).

    With this kind of pattern all you have to do is to loop over the match results to check when a new if statement begins.

    Note that you can put \G in factor of all the last branches of the pattern with a non-capturing group \G(?: branch2 | branch3 ...), and better, you can replace it with the \G(?!\A) sequence to avoid an inopportune match at the start of the string (since \G succeeds at the start of the string by default).

    To know which branch has succeeded, you can create a capture group, lets say stmt, to capture the statement: %(?<stmt>if)\ , %(?<stmt>elseif)\ , etc. but you can also use a more funny alternative using the (*MARK) control verb that can be labeled (all you have to do is to put it somewhere in the corresponding branch (*MARK:if), (*MARK:elseif), ...). With preg_match_all a MARK item is created and filled with the label in the result array, but it doesn't work with preg_replace_callback.