I have tried strip_tag but it still leaves inline js : (function(){..}) and also inline css #button{}
I need to extract pure text from html without any JS function or styling or tags so that I can index it and use for for my search functionality.
html2text also doesnt seem to solve the problem!
PHP Code:
$url = "http://blog.everymansoftware.com/2011/11/development-setup-for-neo4j-and-php_05.html";
$fileHeaders = @get_headers($url);
if( $fileHeaders[0] == "HTTP/1.1 200 OK" || $fileHeaders[0] == "HTTP/1.0 200 OK")
$content = strip_tags(file_get_contents($url));
$content =
Everyman Software: Development Setup for Neo4j and PHP: Part 2
Development Setup for Neo4j and PHP: Part 2
This is Part 2 of a series on setting up a development environment for building projects using the graph database Neo4j and PHP. In Part 1 of this series, we set up unit test and development databases. In this part, we'll build a skeleton project that includes unit tests, and a minimalistic user interface.
All the files will live under a directory on our web server. In a real project, you'll probably want only the user interface files under the web server directory and your testing and library files somewhere more protected.
Also, I won't be using any specific PHP framework. The principles in t
This is a small snippet I always use to remove all the hidden text from a webpage, including everything inbetween <script>, <style>, <head>
etc tags. Also it will replace all the multiple occurrences of any kind of whitespace with a single space.
$url = "http://blog.everymansoftware.com/2011/11/development-setup-for-neo4j-and-php_05.html";
$fileHeaders = @get_headers($url);
if( $fileHeaders[0] == "HTTP/1.1 200 OK" || $fileHeaders[0] == "HTTP/1.0 200 OK")
$content = strip_html_tags(file_url_contents($url));
//To fetch the $url by using cURL
function file_url_contents($url){
$crl = curl_init();
$timeout = 30;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
return $ret;
} //file_url_contents ENDS
//To remove all the hidden text not displayed on a webpage
function strip_html_tags($str){
$str = preg_replace('/(<|>)\1{2}/is', '', $str);
$str = preg_replace(
array(// Remove invisible content
"", //replace above with nothing
$str );
$str = replaceWhitespace($str);
$str = strip_tags($str);
return $str;
} //function strip_html_tags ENDS
//To replace all types of whitespace with a single space
function replaceWhitespace($str) {
$result = $str;
foreach (array(
" ", " \t", " \r", " \n",
"\t\t", "\t ", "\t\r", "\t\n",
"\r\r", "\r ", "\r\t", "\r\n",
"\n\n", "\n ", "\n\t", "\n\r",
) as $replacement) {
$result = str_replace($replacement, $replacement[0], $result);
return $str !== $result ? replaceWhitespace($result) : $result;
See it in action here http://codepad.viper-7.com/txIxfE
And output: http://pastebin.com/a86jd17s