Search code examples
phphtmlweb-scrapingdom

Extract plaintext within Div that includes other dom elements but not within any tags


I'm trying to extract some plain text that isn't wrapped in any tags using PHP. Best way to explain is to show;

<div>
    <span>Hello</span>
        THIS IS THE TEXT I WANT TO EXTRACT
    <span>this is some other text</span>
    <div><span>pow</span></div>
</div>

What I'm just about to try out is to loop through and remove all dom elements within the div and that should leave the text. But I'm hoping there's a more elegant method


Solution

  • If I am reading your question correctly, you want to get the text for the element, but excluding the text for child elements.

    Using JavaScript, there is a solution for that here:

    http://www.stevefenton.co.uk/Content/Blog/Date/201007/Blog/Jquery-Get-Text-While-Excluding-Children/

    And in summary, you would do this...

    $("#mydiv").clone().children().remove().end().text();
    

    In PHP (using phpquery) this would be...

    $phpqueryObj = phpQuery::newDocument(DOMinnerHTML($INNERHTMLOFYOURDOMELEMENT));
    $text = $phpqueryObj->clone()->children()->remove()->end()->text();
    

    Without jQuery / JavaScript you would have to perform a similar process manually, i.e. remove the child elements form a cloned version of the element and then get the inner text.