Search code examples
phpregexpreg-replacepreg-matchpreg-split

How can I adapt my regex to allow for escaped quotes?


Introduction

First my general issue is that I want to string replace question marks in a string, but only when they are not quoted. So I found a similar answer on SO (link) and began testing out the code. Unfortunately, of course, the code does not take into account escaped quotes.

For example: $string = 'hello="is it me your are looking for\\"?" AND test=?';

I have adapted a regular expression and code from that answer to the question: How to replace words outside double and single quotes, which is reproduced here for ease of reading my question:

<?php
function str_replace_outside_quotes($replace,$with,$string){
    $result = "";
    $outside = preg_split('/("[^"]*"|\'[^\']*\')/',$string,-1,PREG_SPLIT_DELIM_CAPTURE);
    while ($outside)
        $result .= str_replace($replace,$with,array_shift($outside)).array_shift($outside);
    return $result;
}
?>

Actual issue

So I have attempted to adjust the pattern to allow for it to match anything that is not a quote " and quotes that are escaped \":

<?php
$pattern = '/("(\\"|[^"])*"' . '|' . "'[^']*')/";

// when parsed/echoed by PHP the pattern evaluates to
// /("(\"|[^"])*"|'[^']*')/
?>

But this does not work as I had hoped.

My test string is: hello="is it me your are looking for\"?" AND test=?

And I am getting the following matches:

array
  0 => string 'hello=' (length=6)
  1 => string '"is it me your are looking for\"?"' (length=34)
  2 => string '?' (length=1)
  3 => string ' AND test=?' (length=11)

Match index two should not be there. That question mark should be considered part of match index 1 only and not repeated separately.

Once resolved this same fix should also correct the other side of the main alternation for single quotes/apostrophes as well '.

After this is parsed by the complete function it should output:

echo str_replace_outside_quotes('?', '%s', 'hello="is it me your are looking for\\"?" AND test=?');
// hello="is it me your are looking for\"?" AND test=%s

I hope that this makes sense and I have provided enough information to answer the question. If not I will happily provide whatever you need.

Debug code

My current (complete) code sample is on codepad for forking as well:

function str_replace_outside_quotes($replace, $with, $string){
    $result = '';
    var_dump($string);
    $pattern = '/("(\\"|[^"])*"' . '|' . "'[^']*')/";
    var_dump($pattern);
    $outside = preg_split($pattern, $string, -1, PREG_SPLIT_DELIM_CAPTURE);
    var_dump($outside);
    while ($outside) {
        $result .= str_replace($replace, $with, array_shift($outside)) . array_shift($outside);
    }
    return $result;
}
echo str_replace_outside_quotes('?', '%s', 'hello="is it me your are looking for\\"?" AND test=?');

Sample input and expected output

In: hello="is it me your are looking for\\"?" AND test=? AND hello='is it me your are looking for\\'?' AND test=? hello="is it me your are looking for\\"?" AND test=?' AND hello='is it me your are looking for\\'?' AND test=?
Out: hello="is it me your are looking for\\"?" AND test=%s AND hello='is it me your are looking for\\'?' AND test=%s hello="is it me your are looking for\\"?" AND test=%s AND hello='is it me your are looking for\\'?' AND test=%s

In: my_var = ? AND var_test = "phoned?" AND story = 'he said \'where is it?!?\''
Out: my_var = %s AND var_test = "phoned?" AND story = 'he said \'where is it?!?\''

Solution

  • The following tested script first checks that a given string is valid, consisting solely of single quoted, double quoted and un-quoted chunks. The $re_valid regex performs this validation task. If the string is valid, it then parses the string one chunk at a time using preg_replace_callback() and the $re_parse regex. The callback function processes the unquoted chunks using preg_replace(), and returns all quoted chunks unaltered. The only tricky part of the logic is passing the $replace and $with argument values from the main function to the callback function. (Note that PHP procedural code makes this variable passing from the main function to the callback function a bit awkward.) Here is the script:

    <?php // test.php Rev:20121113_1500
    function str_replace_outside_quotes($replace, $with, $string){
        $re_valid = '/
            # Validate string having embedded quoted substrings.
            ^                           # Anchor to start of string.
            (?:                         # Zero or more string chunks.
              "[^"\\\\]*(?:\\\\.[^"\\\\]*)*"  # Either a double quoted chunk,
            | \'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\'  # or a single quoted chunk,
            | [^\'"\\\\]+               # or an unquoted chunk (no escapes).
            )*                          # Zero or more string chunks.
            \z                          # Anchor to end of string.
            /sx';
        if (!preg_match($re_valid, $string)) // Exit if string is invalid.
            exit("Error! String not valid.");
        $re_parse = '/
            # Match one chunk of a valid string having embedded quoted substrings.
              (                         # Either $1: Quoted chunk.
                "[^"\\\\]*(?:\\\\.[^"\\\\]*)*"  # Either a double quoted chunk,
              | \'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\'  # or a single quoted chunk.
              )                         # End $1: Quoted chunk.
            | ([^\'"\\\\]+)             # or $2: an unquoted chunk (no escapes).
            /sx';
        _cb(null, $replace, $with); // Pass args to callback func.
        return preg_replace_callback($re_parse, '_cb', $string);
    }
    function _cb($matches, $replace = null, $with = null) {
        // Only set local static vars on first call.
        static $_replace, $_with;
        if (!isset($matches)) { 
            $_replace = $replace;
            $_with = $with;
            return; // First call is done.
        }
        // Return quoted string chunks (in group $1) unaltered.
        if ($matches[1]) return $matches[1];
        // Process only unquoted chunks (in group $2).
        return preg_replace('/'. preg_quote($_replace, '/') .'/',
            $_with, $matches[2]);
    }
    $data = file_get_contents('testdata.txt');
    $output = str_replace_outside_quotes('?', '%s', $data);
    file_put_contents('testdata_out.txt', $output);
    ?>