Search code examples
phpregexpreg-split

preg_split with two patterns (one of them quoted)


I would like to split a string in PHP containing quoted and unquoted substrings.
Let's say I have the following string:

"this is a string" cat dog "cow"  

The splitted array should look like this:

array (  
[0] => "this is a string"  
[1] => "cat"  
[2] => "dog"  
[3] => "cow"  
)

I'm struggling a bit with regex and I'm wondering if it is even possible to achieve with just one regex/preg_split-Call...

The first thing I tried was:

[[:blank:]]*(?=(?:[^"]*"[^"]*")*[^"]*$)[[:blank:]]*

But this splits only array[0] and array[3] correctly - the rest is splitted on a per character base.

Then I found this link:
PHP preg_split with two delimiters unless a delimiter is within quotes

(?=(?:[^"]*"[^"]*")*[^"]*$)

This seems to me as a good startingpoint. However the result in my example is the same as with the first regex.

I tried combining both - first the one for quoted strings and then a second sub-regex which should ommit quoted string (therefore the [^"]):

(?=(?:[^"]*"[^"]*")*[^"]*$)|[[:blank:]]*([^"].*[^"])[[:blank:]]*

Therefore 2 questions:

  1. Is it even possible to achieve what I want with just one regex/preg_split-Call?
  2. If yes, I would appreciate a hint on how to assemble the regex correctly

Solution

  • Since matches cannot overlap, you could use preg_match_all like this:

    preg_match_all('/"[^"]*"|\S+/', $input, $matches);
    

    Now $matches[0] should contain what you are looking for. The regex will first try to match a quoted string, and then stop. If that doesn't do it it will just collect as many non-whitespace characters as possible. Since alternations are tried from left to right, the quoted version takes precedence.

    EDIT: This will not get rid of the quotes though. To do this, you could use capturing groups:

    preg_match_all('/(?|"([^"]*)"|(\S+))/', $input, $matches);
    

    Now $matches[1] will contain exactly what you are looking for. The (?| is there so that both capturing groups end up at the same index.

    EDIT 2: Since you were asking for a preg_split solution, that is also possible. We can use a lookahead, that asserts that the space is followed by an even number of quotes (up until the end of the string):

    $result = preg_split('/\s+(?=(?:[^"]*"[^"]*")*$)/', $input);
    

    Of course, this will not get rid of the quotes, but that can easily be done in a separate step.