Search code examples
phpregexlatexnon-greedy

PHP PCRE regular expression


In LaTeX, the expression \o{a}{b} means the operator 'o' takes two arguments a and b. LaTeX also accepts \o{a}, and in this case treats the second argument as the empty string.

Now I try to match the regex \\\\o\{([\s\S]*?)\}\{([\s\S]*?)\} against the string \o{a}\o{a}{b}. It mistakes the whole string to be a match when it isn't. (The correct interpretation of this string is that the substring \o{a}{b} is the only match.) The point is I need to know how to tell PHP to recognise that if there is something else than { following the first }, then it is not a match.

How should I do that?

Edit: Arguments of an operator are allowed to contain the symbols \, { and }. But in this case the reason the whole string is not a match is because the curly brackets in a}\o{a do not conform to LaTeX rules (e.g. { must come before }), so that a}\o{a cannot be an argument of an operator...

Edit2: On the other hand, \o{{a}}{b} should be a match as {a} is a valid argument.


Solution

  • I suggest something like this:

    $s = '\\o{a}\\o{a}{b}';
    echo "$s\n";  # Check string
    preg_match('~\\\o(\{(?>[^{}\\\]++|(?1)|\\\.)+\}){2}~', $s, $match);
    print_r($match);
    

    ideone demo

    The regex:

    • uses recursion to deal with nested braces,
    • uses backslashes too ([^{}\\\] and \\\.) to avoid taking literal braces for syntactical braces.

    \\\o             # Matches \o
    (                # Recursive group to be
      \{             # Matches {
      (?>            # Begin atomic group (just a group that makes the regex faster)
         [^{}\\\]++  # Any characteres except braces and backslash
      |
         (?1)        # Or recurse the outer group
      |
         \\\.        # Or match an escaped character
      )+             # As many times as necessary
      \}             # Closing brace
    ){2}             # Repeat twice
    

    The problem with your current regex is that once this part matched \\\\o\{([\s\S]*?), it will try to look for the next \} that is coming, and there, it matters not whether you are using a lazy quantifier or a greedy one. You need to somehow prevent it to match } before the actual \} comes in the regex.

    That's why you have to use [^{}] and since you actually can have nested braces inside, that's the ideal situation to use recursion.