Search code examples
phphtmlweb-scrapingsimple-html-dom

PHP Extract all text from html page


I have been scratching my head over it for past 1 hour. Is there any reliable way to extract ONLY text

and nothing else (code,images,link,styles,script) from a html page. I am trying to extract all the text inside body of html document.

This includes paragraphs,plain text and tabular data..

So far I have tried simplehtmldom parser and also file_get_contents but both of them are not working. Here is code:

<?php

require_once "simple_html_dom.php";

function getplaintextintrofromhtml($html) {

    // Remove the HTML tags
    $html = strip_tags($html);

    // Convert HTML entities to single characters
    $html = html_entity_decode($html, ENT_QUOTES, 'UTF-8');

    return $html;

}

$html = file_get_contents('http://www.thefreedictionary.com/contempt');

echo getplaintextintrofromhtml($html);
?>

Here is screenshot of output:

https://docs.google.com/file/d/0B-b63LoI1gSfaGhpR0NvdUtlbW8/edit?usp=drivesdk

As you can see it is displaying weird output and not even displaying whole page text


Solution

  • I don't why you think SimpleHTMLDOM doesn't work but you just have to use it properly, just target the body, then use the ->innertext attribute:

    function getplaintextintrofromhtml($url) {
        include 'simple_html_dom.php';
    
        $html = file_get_html($url);
        // point to the body, then get the innertext
        $data = $html->find('body', 0)->innertext;
        return $data;
    }
    
    echo getplaintextintrofromhtml('http://www.thefreedictionary.com/contempt');