Search code examples
phpregexpreg-split

Preg_split on quote, except when followed by another quote


I'm trying to split a UTF-8 string on a quote character (") with delimiter capture, except where that quote is followed by a second quote ("") so that (for example)

"A ""B"" C" & "D ""E"" F"

will split into three elements

"A ""B"" C"
&
"D ""E"" F"

I've been attempting to use:

$string = '"A ""B"" C" & "D ""E"" F"';
$temp = preg_split(
    '/"[^"]/mui',
    $string,
    null, 
    PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE
);

but without success as it gives me

array(7) {
  [0]=>
  string(2) " ""
  [1]=>
  string(1) """
  [2]=>
  string(1) "C"
  [3]=>
  string(2) "& "
  [4]=>
  string(2) " ""
  [5]=>
  string(1) """
  [6]=>
  string(2) "F""
}

So it's losing any characters that immediately follow a quote unless that character is also a quote

In this example there's a quote as the first and last characters in the string, though that may not always be the case, e.g.

{ "A ""B"" C" & "D ""E"" F" }

needs to split into five elements

{
"A ""B"" C"
&
"D ""E"" F"
}

Can anybody help me get this working?


Solution

  • Since you said that you don't mind the quotes to be consumed on the split, you can use the expression:

    (?<!")\s?"\s?(?!")
    

    Where two negative lookarounds are used. The output on your sample will be:

    { 
    A ""B"" C
    &
    D ""E"" F
    }
    

    [I put the \s? to consume any trailing space, remove them if you want to keep them]