Search code examples
phphtmlxmlxpathdomcrawler

How to get all TEXT outside elements in a HTML document


I'm using Symfony DomCrawler to get all text in a document.

$this->crawler->filter('p')->each(function (Crawler $node, $i) {
    // process text
});

I'm trying to gather all text within the <body> that are outside of elements.

<body>
    This is an example
    <p>
        blablabla
    </p>
    another example
    <p>
        <span>Yo!</span>
        again, another piece of text <br/>
        with an annoy BR in the middle
    </p>
</body>

I'm using PHP Symfony and can use XPath (preferred) or RegEx.


Solution

  • The string value of the entire document can be obtained with this simple XPath:

    string(/)
    

    All text nodes in the document would be:

    //text()
    

    The immediate text node children of body would be:

    /body/text()
    

    Note that the XPaths that select text nodes would typically be converted to concatenated string values, depending upon context.