Search code examples
phpregextemplate-engine

Regular expression for template engine?


I'm learning about regular expressions and want to write a templating engine in PHP.

Consider the following "template":

<!DOCTYPE html>
<html lang="{{print("{hey}")}}" dir="{{$dir}}">
<head>
    <meta charset="{{$charset}}">
</head>
<body>
    {{$body}}
    {{}}
</body>
</html>

I managed to create a regex that will find anything except for {{}}.

Here's my regex:

{{[^}]+([^{])*}}

There's just one problem. How do I allow the literal { and } to be used within {{}} tags?

It will not find {{print("{hey}")}}.

Thanks in advance.


Solution

  • This is a pattern to match the content inside double curly brackets:

    $pattern = <<<'LOD'
    ~
    (?(DEFINE)
        (?<quoted>
            ' (?: [^'\\]+ | (?:\\.)+ )++ ' |
            " (?: [^"\\]+ | (?:\\.)+ )++ "
        )
        (?<nested>
            { (?: [^"'{}]+ | \g<quoted> | \g<nested> )*+ }
        )
    )
    
    {{
        (?<content>
            (?: 
                [^"'{}]+
              | \g<quoted>  
              | \g<nested>
    
            )*+
        )
    }}
    ~xs
    LOD;
    

    Compact version:

    $pattern = '~{{((?>[^"\'{}]+|((["\'])(?:[^"\'\\\]+|(?:\\.)+|(?:(?!\3)["\'])+)++\3)|({(?:[^"\'{}]+|\g<2>|(?4))*+}))*+)}}~s';
    

    The content is in the first capturing group, but you can use the named capture 'content' with the detailed version.

    If this pattern is longer, it allows all that you want inside quoted parts including escaped quotes, and is faster than a simple lazy quantifier in much cases. Nested curly brackets are allowed too, you can write {{ doThat(){ doThis(){ }}}} without problems.

    The subpattern for quotes can be written like this too, avoiding to repeat the same thing for single and double quotes (I use it in compact version)

    (["'])             # the quote type is captured (single or double)
    (?:                # open a group (for the various alternatives)
        [^"'\\]+       # all characters that are not a quote or a backslash
      |                # OR
        (?:\\.)+       # escaped characters (with the \s modifier)
      |                #
        (?!\g{-1})["'] # a quote that is not the captured quote
    )++                # repeat one or more times
    \g{-1}             # the captured quote (-1 refers to the last capturing group)
    

    Notice: a backslash must be written \\ in nowdoc syntax but \\\ or \\\\ inside single quotes.

    Explanations for the detailed pattern:

    The pattern is divided in two parts:

    • the definitions where i define named subpatterns
    • the whole pattern itself

    The definition section is useful to avoid to repeat always the same subpattern several times in the main pattern or to make it more clear. You can define subpatterns that you will use later in this space:
    (?(DEFINE)....)

    This section contains 2 named subpatterns:

    • quoted : that contains the description of quoted parts
    • nested : that describes nested curly brackets parts

    detail of nested

    (?<nested>           # open the named group "nested"
        {                # literal {
     ## what can contain curly brackets? ##
        (?>              # open an atomic* group
            [^"'{}]+     # all characters one or more times, except "'{}
          |              # OR
            \g<quoted>   # quoted content, to avoid curly brackets inside quoted parts
                         # (I call the subpattern I have defined before, instead of rewrite all)
          | \g<nested>   # OR curly parts. This is a recursion
        )*+              # repeat the atomic group zero or more times (possessive *)
        }                # literal }
    )                    # close the named group
    

    (* more informations about atomic groups and possessive quantifiers)

    But all of this are only definitions, the pattern begins really with: {{ Then I open a named capture group (content) and I describe what can be found inside, (nothing new here).

    I use to modifiers, x and s. x activates the verbose mode that allows to put freely spaces in the pattern (useful to indent). s is the singleline mode. In this mode, the dot can match newlines (it can't by default). I use this mode because there is a dot in the subpattern quoted.