Search code examples
phphtmlregexpreg-split

PHP preg_split on spaces, but not within tags


i am using preg_split("/\"[^\"]*\"(*SKIP)(*F)|\x20/", $input_line); and run it on phpliveregex.com it produce array :

array(10
  0=><b>test</b>
  1=>or
  2=><em>oh
  3=>yeah</em>
  4=>and
  5=><i>
  6=>oh
  7=>yeah
  8=></i>
  9=>"ye we 'hold' it"
)

NOT what i want, it should be seperate by spaces only outside html tags like this:

array(5
  0=><b>test</b>
  1=>or
  2=><em>oh yeah</em>
  3=>and
  4=><i>oh yeah</i>
  5=>"ye we 'hold' it"
)

in this regex i am only can add exception in "double quote" but realy need help to add more, like tag <img/><a></a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>

any explanation about how that regex works also appreciate.


Solution

  • It's easier to use the DOMDocument since you don't need to describe what a html tag is and how it looks. You only need to check the nodeType. When it's a textNode, split it with preg_match_all (it's more handy than to design a pattern for preg_split):

    $html = 'spaces in a text node <b>test</b> or <em>oh yeah</em> and <i>oh yeah</i>
    "ye we \'hold\' it"
    "unclosed double quotes at the end';
    
    $dom = new DOMDocument;
    $dom->loadHTML('<div>' . $html . '</div>', LIBXML_HTML_NOIMPLIED);
    
    $nodeList = $dom->documentElement->childNodes;
    
    $results = [];
    
    foreach ($nodeList as $childNode) {
        if ($childNode->nodeType == XML_TEXT_NODE &&
            preg_match_all('~[^\s"]+|"[^"]*"?~', $childNode->nodeValue, $m))
            $results = array_merge($results, $m[0]);
        else
            $results[] = $dom->saveHTML($childNode);
    }
    
    print_r($results);
    

    Note: I have chosen a default behaviour when a double quote part stays unclosed (without a closing quote), feel free to change it.

    Note2: Sometimes LIBXML_ constants are not defined. You can solve this problem testing it before and defining it when needed:

    if (!defined('LIBXML_HTML_NOIMPLIED'))
        define('LIBXML_HTML_NOIMPLIED', 8192);