Search code examples
phpindexingfull-text-searchsearch-engine

Strip html to remove all js/css/html tags to give actual text(displayed on browser) to use it for indexing and search


I have tried strip_tag but it still leaves inline js : (function(){..}) and also inline css #button{}

I need to extract pure text from html without any JS function or styling or tags so that I can index it and use for for my search functionality.

html2text also doesnt seem to solve the problem!

EDIT

PHP Code:

$url = "http://blog.everymansoftware.com/2011/11/development-setup-for-neo4j-and-php_05.html";

$fileHeaders = @get_headers($url);
if( $fileHeaders[0] == "HTTP/1.1 200 OK" || $fileHeaders[0] == "HTTP/1.0 200 OK")
        {
            $content = strip_tags(file_get_contents($url));
        }

OUTPUT :

$content =

(function() { var a=window,c="jstiming",d="tick";var e=function(b){this.t={};this.tick=function(b,o,f){f=void 0!=f?f:(new Date).getTime();this.t[b]=[f,o]};this[d]("start",null,b)},h=new e;a.jstiming={Timer:e,load:h};if(a.performance&&a.performance.timing){var i=a.performance.timing,j=a[c].load,k=i.navigationStart,l=i.responseStart;0=k&&(j[d]("_wtsrt",void 0,k),j[d]("wtsrt_","_wtsrt",l))}
try{var m=null;a.chrome&&a.chrome.csi&&(m=Math.floor(a.chrome.csi().pageT));null==m&&a.gtbExternal&&(m=a.gtbExternal.pageT());null==m&&a.external&&(m=a.external.pageT);m&&(a[c].pt=m)}catch(n){};a.tickAboveFold=function(b){var g=0;if(b.offsetParent){do g+=b.offsetTop;while(b=b.offsetParent)}b=g;750>=b&&a[c].load[d]("aft")};var p=!1;function q(){p||(p=!0,a[c].load[d]("firstScrollTime"))}a.addEventListener?a.addEventListener("scroll",q,!1):a.attachEvent("onscroll",q);
 })();











Everyman Software: Development Setup for Neo4j and PHP: Part 2



#navbar-iframe { display:block }






if(window.addEventListener) {
    window.addEventListener('load', prettyPrint, false);
  } else {
    window.attachEvent('onload', prettyPrint);
  }
var a=navigator,b="userAgent",c="indexOf",f="&m=1",g="(^|&)m=",h="?",i="?m=1";function j(){var d=window.location.href,e=d.split(h);switch(e.length){case 1:return d+i;case 2:return 0



2011-11-05







Development Setup for Neo4j and PHP: Part 2





This is Part 2 of a series on setting up a development environment for building projects using the graph database Neo4j and PHP. In Part 1 of this series, we set up unit test and development databases.  In this part, we'll build a skeleton project that includes unit tests, and a minimalistic user interface.

All the files will live under a directory on our web server. In a real project, you'll probably want only the user interface files under the web server directory and your testing and library files somewhere more protected.

Also, I won't be using any specific PHP framework.  The principles in t

Solution

  • This is a small snippet I always use to remove all the hidden text from a webpage, including everything inbetween <script>, <style>, <head> etc tags. Also it will replace all the multiple occurrences of any kind of whitespace with a single space.

    <?php
    $url = "http://blog.everymansoftware.com/2011/11/development-setup-for-neo4j-and-php_05.html";
        $fileHeaders = @get_headers($url);
        if( $fileHeaders[0] == "HTTP/1.1 200 OK" || $fileHeaders[0] == "HTTP/1.0 200 OK")
        {
                $content = strip_html_tags(file_url_contents($url));
        }
    
    ############################################
    //To fetch the $url by using cURL
    function file_url_contents($url){
        $crl = curl_init();
        $timeout = 30;
        curl_setopt ($crl, CURLOPT_URL,$url);
        curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
        $ret = curl_exec($crl);
        curl_close($crl);
        return $ret;
    } //file_url_contents ENDS
    
    //To remove all the hidden text not displayed on a webpage
    function strip_html_tags($str){
        $str = preg_replace('/(<|>)\1{2}/is', '', $str);
        $str = preg_replace(
            array(// Remove invisible content
                '@<head[^>]*?>.*?</head>@siu',
                '@<style[^>]*?>.*?</style>@siu',
                '@<script[^>]*?.*?</script>@siu',
                '@<noscript[^>]*?.*?</noscript>@siu',
                ),
            "", //replace above with nothing
            $str );
        $str = replaceWhitespace($str);
        $str = strip_tags($str);
        return $str;
    } //function strip_html_tags ENDS
    
    //To replace all types of whitespace with a single space
    function replaceWhitespace($str) {
        $result = $str;
        foreach (array(
        "  ", " \t",  " \r",  " \n",
        "\t\t", "\t ", "\t\r", "\t\n",
        "\r\r", "\r ", "\r\t", "\r\n",
        "\n\n", "\n ", "\n\t", "\n\r",
        ) as $replacement) {
        $result = str_replace($replacement, $replacement[0], $result);
        }
        return $str !== $result ? replaceWhitespace($result) : $result;
    }
    ############################
    ?>
    

    See it in action here http://codepad.viper-7.com/txIxfE
    And output: http://pastebin.com/a86jd17s