Search code examples
phppreg-replacehtmlcleaner

Strip out HTML and Malicious code leaving punctuation and foreign languages in PHP


function stripAlpha( $item )
{
    $search     = array( 
         '@<script[^>]*?>.*?</script>@si'   // Strip out javascript 
        ,'@<style[^>]*?>.*?</style>@siU'    // Strip style tags properly 
        ,'@<[\/\!]*?[^<>]*?>@si'            // Strip out HTML tags
        ,'@<![\s\S]*?–[ \t\n\r]*>@'         // Strip multi-line comments including CDATA
        ,'/\s{2,}/'
        ,'/(\s){2,}/'
    );
    $pattern    = array(
         '#[^a-zA-Z ]#'                     // Non alpha characters
        ,'/\s+/'                            // More than one whitespace
    );
    $replace    = array(
         ''
        ,' '
    );
    $item = preg_replace( $search, '', html_entity_decode( $item ) );
    $item = trim( preg_replace( $pattern, $replace, strip_tags( $item ) ) );

    return $item;
}

One person suggested replacing this entire script with one liner:

$clear = preg_replace('/[^A-Za-z0-9\-]/', '', urldecode($_GET['id']));

but that gives an error with the $_GET command - unknown variable ID

what I'm looking for is the simplest script to remove all HTML code and weird characters, replacing carriage returns with spaces and leaving punctuation like dots commas and exclamation points.

There are a lot of similar questions but none seem to really answer this question right and those scripts strip away all characters including sentence punctuation and foreign Arabic fonts or spanish.

for example if the string contains www.mygreatwebsite.com

the cleaner script will return wwwmygreatwebsitecom which looks weird.

If someone is excited about something like 'Hey this is a great website! ' it also removes the exclamation points.

All the similar questions out there that I've looked up remove all the characters....

I'd like to leave IN the punctuation and any foreign language characters with one simple regex command that clears out all the stuff people paste into forms, but leaves the punctuation.

Naturally carriage returns would be replaced by spaces.

Any suggestions?


Solution

  • To remove all html code, it's easy, use strip_tags

    $text = strip_tags($html);
    

    But it works only if the string doesn't contain css or javascript code.

    So a better way that deals with this problem is to use DOMDocument and XPath to find all text nodes that haven't a style or a script tag as ancestor:

    $dom = new DOMDocument;
    $dom->loadHTML($html);
    
    $xp = new DOMXPath($dom);
    
    $textNodeList = $xp->query('//text()[not(ancestor::script) and not(ancestor::style)]');
    
    $text = '';
    
    foreach($textNodeList as $textNode) {
        $text .= ' '. $textNode->nodeValue;
    }
    

    to replace weird characters and white-space characters except punctuation with a space:

    $text = preg_replace('~[^\pP\pL\pN]+~u', ' ', $text);
    

    Where \pP is a character class for punctuation characters, \pL for letters, \pN for digits. (to be more precise about the characters you want to preserve, take a look at the available character classes here (search for "Unicode character properties"))

    obviously, you can trim the text to finish:

    $text = trim($text);