Search code examples
phpregexpreg-split

string to array, split by single and double quotes, ignoring escaped quotes


i have another php preg_split question which is very similar to my last question, although i fear the solution will be quite a bit more complicated. as before, i'm trying to use php to split a string into array components using either " or ' as the delimiter. however in addition to this i would like to ignore escaped single quotations within the string (escaped double quotations within a string will not happen so there is no need to worry about that). all of the examples from my last question remain valid, but in addition the following two desired results should also be obtained:

$pattern = "?????";
$str = "the 'cat\'s dad sat on' the mat then \"fell 'sideways' off\" the mat";
$res = preg_split($pattern, $str, null, PREG_SPLIT_DELIM_CAPTURE);
print_r($res);
/*output:
Array
(
    [0] => the 
    [1] => 'cat\'s dad sat on'
    [2] =>  the mat then
    [3] => "fell 'sideways' off"
    [4] =>  the mat
)*/

$str = "the \"cat\'s dad\" sat on 'the \"cat\'s\" own' mat";
$res = preg_split($pattern, $str, null, PREG_SPLIT_DELIM_CAPTURE);
print_r($res);
/*output:
Array
(
    [0] => the 
    [1] => "cat\'s dad" 
    [2] =>  sat on
    [3] => 'the "cat\'s" own'
    [4] =>  mat
)*/

@mcrumley's answer to my previous question worked well if there were no escaped quotations:

$pattern = "/('[^']*'|\"[^\"]*\")/U";

however as soon as an escaped single quotation is given the regex uses it as the end of the match, which is not what i want.

i have tried something like this:

$pattern = "/('(?<=(?!\\').*)'|\"(?<=(?!\\').*)\")/";

but its not working. unfortunately my knowledge of lookarounds is not good enough for this.

after some reading and fiddling...

this seems closer:

$pattern = "/('(?:(?!\\').*)')|(\"(?:(?!\\'|').*)\")/";

but the level of greedyness is wrong and does not produce the above outputs.


Solution

  • Try this:

    $pattern = "/(?<!\\\\)('(?:\\\\'|[^'])*'|\"(?:\\\\\"|[^\"])*\")/";
                 ^^^^^^^^^  ^^^^^^^^^    ^     ^^^^^^^^^^     ^
    

    Demo at http://rubular.com/r/Eps2mx8KCw.

    You can also collapse that into a unified expression using back-references:

    $pattern = "/(?<!\\\\)((['\"])(?:\\\\\\2|(?!\\2).)*\\2)/";
    

    Demo at http://rubular.com/r/NLZKyr9xLk.

    These don't work though if you also want escaped backslashes to be recognized in your text, but I doubt that's a scenario you need to account for.