Search code examples
phpregexxpathdomdocumentdomxpath

Extract portion of HTML document - need to include xHTML markup


I have a situation where I need to extract a portion of a xHTML page, including the markup.

A regex in this case is not the correct route, as I am not guaranteed the exact number of child divs.

<div id="myDiv">
    <div><p>This is some content</p></div>
    <div><p>This additional content</p></div>
</div>

So, in the above snippet, I need to extract the <div><p>This is some content</p></div>, which includes the markup.

I've done some looking into using xPath, and it seems to be way to get this done, but I'm not certain how to get it to return not only the values of the nodes, but all of the associated mark-up.


Solution

  • You are correct, and this can be achieved through DOMDocument and XPath like so:

    $doc = new DOMDocument();
    $doc->loadHTML( $html); // Load the HTML snippet
    
    $xpath = new DOMXPath( $doc);
    $node = $xpath->query( '//div[@id="myDiv"]/div')->item(0); // Get the <div>
    
    $saved_node = $doc->saveHTML( $node); // Export that node
    

    In the output, you can see the desired string, including markup:

    string(62) "<div><p>This is some content</p></div>" 
    

    Note that I had to run the output through htmlentities() so you would see the <div> without viewing the source of the page.