I have been scratching my head over it for past 1 hour. Is there any reliable way to extract ONLY text
and nothing else (code,images,link,styles,script) from a html page. I am trying to extract all the text inside body of html document.
This includes paragraphs,plain text and tabular data..
So far I have tried simplehtmldom
parser and also file_get_contents
but both of them are not working. Here is code:
<?php
require_once "simple_html_dom.php";
function getplaintextintrofromhtml($html) {
// Remove the HTML tags
$html = strip_tags($html);
// Convert HTML entities to single characters
$html = html_entity_decode($html, ENT_QUOTES, 'UTF-8');
return $html;
}
$html = file_get_contents('http://www.thefreedictionary.com/contempt');
echo getplaintextintrofromhtml($html);
?>
Here is screenshot of output:
https://docs.google.com/file/d/0B-b63LoI1gSfaGhpR0NvdUtlbW8/edit?usp=drivesdk
As you can see it is displaying weird output and not even displaying whole page text
I don't why you think SimpleHTMLDOM doesn't work but you just have to use it properly, just target the body, then use the ->innertext
attribute:
function getplaintextintrofromhtml($url) {
include 'simple_html_dom.php';
$html = file_get_html($url);
// point to the body, then get the innertext
$data = $html->find('body', 0)->innertext;
return $data;
}
echo getplaintextintrofromhtml('http://www.thefreedictionary.com/contempt');