Search code examples
phpregexparentheses

Regex to match expression with multiple parentheses, one within each other


I'm building a task (in PHP) that reads all the files of my project in search for i18n messages. I want to detect messages like these:

// Basic example
__('Show in English')  => Show in English
// Get the message and the name of the i18n file 
__("Show in English", array(), 'page') => Show in English, page
// Be careful of quotes
__("View Mary's Car", array()) => View Mary's Car
// Be careful of strings after the __() expression
__('at').' '.function($param) => at

The regex expression that works for those cases (there are some other cases taken into account) is:

__\(.*?['|\"](.*?)(?:['|\"][\.|,|\)])(?: *?array\(.*?\),.*?['|\"](.*?)['|\"]\)[^\)])?

However if the expression is in multiple lines it doesn't work. I have to include dotail /s, but it breaks the previous regex expresion as it doesn't control well when to stop looking ahead:

// Detect with multiple lines
echo __('title_in_place', array(
    '%title%' => $place['title']
  ), 'welcome-user'); ?>    

There is one thing that will solve the problem and simplify the regex expression that it's matching open-close parentheses. So no matter what's inside __() or how many parentheses there are, it "counts" the number of openings and expects that number of closings.

Is it possible? How? Thanks a lot!


Solution

  • Yes. First, here is the classic example for simple nested brackets (parentheses):

    \(([^()]|(?R))*\)

    or faster versions which use a possesive quantifier:

    \(([^()]++|(?R))*\)

    or (equivalent) atomic grouping:

    \((?>[^()]+|(?R))*\)

    But you can't use the: (?R) "match whole expression" expression here because the outermost brackets are special (with two leading underscores). Here is a tested script which matches (what I think) you want...

    Solution: Use group $1 (recursive) subroutine call: (?1)

    <?php // test.php Rev:20120625_2200
    $re_message = '/
        # match __(...(...)...) message lines (having arbitrary nesting depth).
        __\(                     # Outermost opening bracket (with leading __().
        (                        # Group $1: Bracket contents (subroutine).
          (?:                    # Group of bracket contents alternatives.
            [^()"\']++           # Either one or more non-brackets, non-quotes,
          | "[^"\\\\]*(?:\\\\[\S\s][^"\\\\]*)*"      # or a double quoted string,
          | \'[^\'\\\\]*(?:\\\\[\S\s][^\'\\\\]*)*\'  # or a single quoted string,
          | \( (?1) \)          # or a nested bracket (repeat group 1 here!).
          )*                    # Zero or more bracket contents alternatives.
        )                       # End $1: recursed subroutine.
        \)                      # Outermost closing bracket.
        .*                      # Match remainder of line following __()
        /mx';
    $data = file_get_contents('testdata.txt');
    $count = preg_match_all($re_message, $data, $matches);
    printf("There were %d __(...) messages found.\n", $count);
    for ($i = 0; $i < $count; ++$i) {
        printf("  message[%d]: %s\n", $i + 1, $matches[0][$i]);
    }
    ?>
    

    Note that this solution handles balanced parentheses (inside the "__(...)" construct) to any arbitrary depth (limited only by host memory). It also correctly handles quoted strings inside the "__(...)" and ignores any parentheses that may appear inside these quoted strings. Good luck. *