i am using preg_split("/\"[^\"]*\"(*SKIP)(*F)|\x20/", $input_line);
and run it on phpliveregex.com
it produce array :
array(10
0=><b>test</b>
1=>or
2=><em>oh
3=>yeah</em>
4=>and
5=><i>
6=>oh
7=>yeah
8=></i>
9=>"ye we 'hold' it"
)
NOT what i want, it should be seperate by spaces only outside html tags like this:
array(5
0=><b>test</b>
1=>or
2=><em>oh yeah</em>
3=>and
4=><i>oh yeah</i>
5=>"ye we 'hold' it"
)
in this regex i am only can add exception in "double quote" but realy need help to add more, like tag <img/><a></a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>
any explanation about how that regex works also appreciate.
It's easier to use the DOMDocument
since you don't need to describe what a html tag is and how it looks. You only need to check the nodeType. When it's a textNode, split it with preg_match_all
(it's more handy than to design a pattern for preg_split
):
$html = 'spaces in a text node <b>test</b> or <em>oh yeah</em> and <i>oh yeah</i>
"ye we \'hold\' it"
"unclosed double quotes at the end';
$dom = new DOMDocument;
$dom->loadHTML('<div>' . $html . '</div>', LIBXML_HTML_NOIMPLIED);
$nodeList = $dom->documentElement->childNodes;
$results = [];
foreach ($nodeList as $childNode) {
if ($childNode->nodeType == XML_TEXT_NODE &&
preg_match_all('~[^\s"]+|"[^"]*"?~', $childNode->nodeValue, $m))
$results = array_merge($results, $m[0]);
else
$results[] = $dom->saveHTML($childNode);
}
print_r($results);
Note: I have chosen a default behaviour when a double quote part stays unclosed (without a closing quote), feel free to change it.
Note2: Sometimes LIBXML_
constants are not defined. You can solve this problem testing it before and defining it when needed:
if (!defined('LIBXML_HTML_NOIMPLIED'))
define('LIBXML_HTML_NOIMPLIED', 8192);