I'm building a task (in PHP) that reads all the files of my project in search for i18n messages. I want to detect messages like these:
// Basic example
__('Show in English') => Show in English
// Get the message and the name of the i18n file
__("Show in English", array(), 'page') => Show in English, page
// Be careful of quotes
__("View Mary's Car", array()) => View Mary's Car
// Be careful of strings after the __() expression
__('at').' '.function($param) => at
The regex expression that works for those cases (there are some other cases taken into account) is:
__\(.*?['|\"](.*?)(?:['|\"][\.|,|\)])(?: *?array\(.*?\),.*?['|\"](.*?)['|\"]\)[^\)])?
However if the expression is in multiple lines it doesn't work. I have to include dotail /s
, but it breaks the previous regex expresion as it doesn't control well when to stop looking ahead:
// Detect with multiple lines
echo __('title_in_place', array(
'%title%' => $place['title']
), 'welcome-user'); ?>
There is one thing that will solve the problem and simplify the regex expression that it's matching open-close parentheses. So no matter what's inside __()
or how many parentheses there are, it "counts" the number of openings and expects that number of closings.
Is it possible? How? Thanks a lot!
Yes. First, here is the classic example for simple nested brackets (parentheses):
\(([^()]|(?R))*\)
or faster versions which use a possesive quantifier:
\(([^()]++|(?R))*\)
or (equivalent) atomic grouping:
\((?>[^()]+|(?R))*\)
But you can't use the: (?R)
"match whole expression" expression here because the outermost brackets are special (with two leading underscores). Here is a tested script which matches (what I think) you want...
$1
(recursive) subroutine call: (?1)
<?php // test.php Rev:20120625_2200
$re_message = '/
# match __(...(...)...) message lines (having arbitrary nesting depth).
__\( # Outermost opening bracket (with leading __().
( # Group $1: Bracket contents (subroutine).
(?: # Group of bracket contents alternatives.
[^()"\']++ # Either one or more non-brackets, non-quotes,
| "[^"\\\\]*(?:\\\\[\S\s][^"\\\\]*)*" # or a double quoted string,
| \'[^\'\\\\]*(?:\\\\[\S\s][^\'\\\\]*)*\' # or a single quoted string,
| \( (?1) \) # or a nested bracket (repeat group 1 here!).
)* # Zero or more bracket contents alternatives.
) # End $1: recursed subroutine.
\) # Outermost closing bracket.
.* # Match remainder of line following __()
/mx';
$data = file_get_contents('testdata.txt');
$count = preg_match_all($re_message, $data, $matches);
printf("There were %d __(...) messages found.\n", $count);
for ($i = 0; $i < $count; ++$i) {
printf(" message[%d]: %s\n", $i + 1, $matches[0][$i]);
}
?>
Note that this solution handles balanced parentheses (inside the "__(...)
" construct) to any arbitrary depth (limited only by host memory). It also correctly handles quoted strings inside the "__(...)
" and ignores any parentheses that may appear inside these quoted strings. Good luck.
*