Search code examples

PHP Extract all text from html page

I have been scratching my head over it for past 1 hour. Is there any reliable way to extract ONLY text

and nothing else (code,images,link,styles,script) from a html page. I am trying to extract all the text inside body of html document.

This includes paragraphs,plain text and tabular data..

So far I have tried simplehtmldom parser and also file_get_contents but both of them are not working. Here is code:


require_once "simple_html_dom.php";

function getplaintextintrofromhtml($html) {

    // Remove the HTML tags
    $html = strip_tags($html);

    // Convert HTML entities to single characters
    $html = html_entity_decode($html, ENT_QUOTES, 'UTF-8');

    return $html;


$html = file_get_contents('');

echo getplaintextintrofromhtml($html);

Here is screenshot of output:

As you can see it is displaying weird output and not even displaying whole page text


  • I don't why you think SimpleHTMLDOM doesn't work but you just have to use it properly, just target the body, then use the ->innertext attribute:

    function getplaintextintrofromhtml($url) {
        include 'simple_html_dom.php';
        $html = file_get_html($url);
        // point to the body, then get the innertext
        $data = $html->find('body', 0)->innertext;
        return $data;
    echo getplaintextintrofromhtml('');