Search code examples
javascriptphpregex

Adjusting Regular Expression to Extract PHP Variables with Parentheses


I have a JavaScript function designed to extract PHP variables from a given string of PHP code. It works well for most cases, but I'm encountering an issue when the variables are enclosed within parentheses. Here's the current implementation:

const extractPhpVariables = (data) => {
    const phpVariables = {};
    // Variable extraction
    const phpVariableRegex = /<\?(?:\s*=\s*|\s*php\s+echo\s+|\s*=\s*|\s*php\s+)?(?:\(\$)?(\$[\w-]+)(?:\)|;)?\s*\?>/g;
    let match;
    while ((match = phpVariableRegex.exec(data)) !== null) {
        const [fullMatch, variableName] = match;
        phpVariables[fullMatch] = { variableName };
    }
    return phpVariables;
};
 
// Example PHP code
const phpCode = `
<?php echo $var2; ?>
<?= $var3; ?>
<?php echo ($mdname); ?>
`;
 
// Call the function
const extractedVariables = extractPhpVariables(phpCode);
console.log(extractedVariables);

The output of this function is:

{
  '<?php echo $var2; ?>': { variableName: '$var2' },
  '<?= $var3; ?>': { variableName: '$var3' }
}

However, I want it to also capture variables enclosed within parentheses, like $mdname in . How can I adjust the regular expression to achieve this? I want the output to be:

{
  '<?php echo $var2; ?>': { variableName: '$var2' },
  '<?= $var3; ?>': { variableName: '$var3' },
  '<?php echo ($mdname); ?>': { variableName: 'mdname' }
}


Solution

  • Personal thoughts about the way of solving your problem

    To be honest, I don't think that doing this parsing with JS and RegExp is the best choice. But I imagine you are doing that during a NPM build step for listing all PHP variables printed in some templates.

    It would certainly be better using template engines such as Twig, Blade, Plates, Mustache, Smarty, Volt, etc. You will then certainly have access to some parser functionalities to list the variables used in your templates.

    But well, in some projects, templates are used in the "old school" way, with the basic <?php echo $var; ?>, like you mentioned.

    Attempt with a regex

    You'll have several cases, for sure, so I just listed a few of them in my example.

    I would first list all your PHP tags and log what they contain. This will help you dress a list of cases, and then build a list of different regular expressions to extract what you need and maybe even pass it to a PHP tokenizer, if you can do that.

    Coming to the regular expression, I'll use the extendend notation of PCRE to help reading the pattern:

    /
    <\?
    # Both syntax: <?= or <?php followed by echo or print.
    (?: \s*=\s* | \s*php\s+(?:echo|print)?\s* )?
    # Several senarios of variable:
    (?:
      # Just a variable alone.
      (?<var_only>\$\w+)
    |
      # Potentially a function call, like trim(), or useless parenthesis.
      # This will not handle functions with multiple parameters.
      (?:\w+\s*)?\(\s* (?<var_in_parenthesis>\$\w+) \s*\)
    |
      # Variable evaluated in a double-quoted string.
      "
      (?<string_content>
        (?:
          \\.     # Any escape char, such as \", \$ or \n.
        |
          (?<var_in_string>\$\w+) # A PHP variable.
        |
          [^"$]+  # Any other char which isn't " or $.
        )*
      )
      "
    )
    # Optional semicolon to put after the variable
    \s*;?\s*
    \?>
    /gxi
    

    Converted to JS, without comments:

    /<\?(?:\s*=\s*|\s*php\s+(?:echo|print)?\s*)?(?:(?<var_only>\$\w+)|(?:\w+\s*)?\(\s*(?<var_in_parenthesis>\$\w+)\s*\)|"(?<string_content>(?:\\.|(?<var_in_string>\$\w+)|[^"$]+)*)")\s*;?\s*\?>/gi
    

    And in execution here: https://regex101.com/r/2TUqSC/1

    But, as we could have several variables inside a string, I simplified a bit the regex to just find double-quoted strings and we can then search for multiple variables inside the string:

    // PCRE regex with comments, converted to JS, without comments, with the help of a
    // tool I made for that: https://codepen.io/patacra/pen/wvQBxjq
    // I then simplified the case of a print of a double-quoted string as we can
    // then search for multiple variables with a second regex.
    
    const regex = /<\?(?:\s*=\s*|\s*php\s+(?:echo|print)?\s*)?(?:(?<var_only>\$\w+)|(?:\w+\s*)?\(\s*(?<var_in_parenthesis>\$\w+)\s*\)|"(?<string_content>(?:\\.|[^"]+)*)")\s*;?\s*\?>/gi;
    
    const templateCode = `<p>This is the content of <code>$var2</code>: <?php echo $var2; ?></p>
    
    <p>Other syntax for <code>$var3</code>: <?= $var3; ?></p>
    
    <p>Why not adding useless parenthesis? <?php echo ($mdname); ?></p>
    
    <h2><?PHP print trim($title); ?></h2>
    
    <p>Dear <?= "Mr. $lastname"; ?>, please don't do this analyze with JS and regex. Use a PHP tool for that.</p>
    
    <p><?php print "$user said \\"Hello!\\" to you"; ?></p>
    
    <table>
      <tr>
        <th>Price</th>
        <td><?= "$price\\$" ?></td>
      </tr>
    </table>
    
    <p>Multiple vars: <?php print "$firstname $lastname is $\\{size\\}cm"; ?> tall.</p>
    
    <h2>Not working cases and the reason to do it with a tokenizer</h2>
    
    <?php printf("Hello %s", $user->name); ?>
    <?php print round($number, 1); ?>
    <?php print 'Hello ' . $user_first_name; ?>`;
    
    let match;
    let variables = [];
    
    while ((match = regex.exec(templateCode)) !== null) {
        if (match.groups.var_only) {
            variables.push(match.groups.var_only);
        }
        else if (match.groups.var_in_parenthesis) {
            variables.push(match.groups.var_in_parenthesis);
        }
        else if (match.groups.string_content) {
            // To match "user" and "size" from "$user is ${size}cm tall".
            const regexVarInString = /(?<=\$)\w+|(?<=\$\{)\w+(?=\})/g;
            let subMatch;
            while ((subMatch = regexVarInString.exec(match.groups.string_content)) !== null) {
                variables.push('$' + subMatch[0]);
            }
        }
    }
    
    console.log("This is the HTML/PHP template file:\n" + templateCode + "\n");
    
    console.log("Found variables are:\n- " + variables.join("\n- "));

    As you can see in my example, this will not work for cases like these:

    <?php printf("Hello %s", $user->name); ?>
    <?php print round($number, 1); ?>
    <?php print 'Hello ' . $user_first_name; ?>
    

    Reason of my first thoughts. Investigate in a PHP tool!

    PHP solution

    <?php
    
    $html_and_php_code = <<<'END_OF_TEMPLATE'
    <p>This is the content of <code>$var2</code>: <?php echo $var2; ?></p>
    
    <p>Other syntax for <code>$var3</code>: <?= $var3; ?></p>
    
    <p>Why not adding useless parenthesis? <?php echo ($mdname); ?></p>
    
    <h2><?PHP print trim($title); ?></h2>
    
    <p>Dear <?= "Mr. $lastname"; ?>, please don\'t do this analyze with JS and regex. Use a PHP tool for that.</p>
    
    <p><?php print "$user said \\"Hello!\\" to you"; ?></p>
    
    <table>
      <tr>
        <th>Price</th>
        <td><?= "$price\\$" ?></td>
      </tr>
    </table>
    
    <p>Multiple vars: <?php print "$firstname $lastname is ${size}cm"; ?> tall.</p>
    
    <h2>Not working cases and the reason to do it with a tokenizer</h2>
    
    <?php printf("Hello %s", $user->name); ?>
    <?php print round($number, 1); ?>
    <?php print 'Hello ' . $user_first_name; ?>
    END_OF_TEMPLATE;
    
    $found_variables = [];
    $tokens = token_get_all($html_and_php_code);
    
    foreach ($tokens as $token) {
      if (is_array($token)) {
        list($id, $text) = $token;
    
        if ($id === T_VARIABLE) {
          $found_variables[] = $text;
        }
      }
    }
    
    print "Found variables:\n- " . implode("\n- ", $found_variables);
    

    Result:

    Found variables:
    - $var2
    - $var3
    - $mdname
    - $title
    - $lastname
    - $user
    - $price
    - $firstname
    - $lastname
    - $user
    - $number
    - $user_first_name
    

    As you see, it missed the ${size}cm, but this syntax is deprecated since PHP 8.2. Perhaps using the tokenizer on a PHP 8.1 would work.

    Test it online: https://onlinephp.io/c/9e473