Search code examples
phpstrip-tags

Strip tags in PHP with an allowed list but remove all attributes


In PHP, what is the fastest and simplest way to strip all HTML tags from a string, except the ones in an allowed list but by removing all HTML attributes.

The built-in function strip_tags would have done the job but the attributes are kept for the tags in the allowed list. I don't know if using regular expressions is the best way and I also don't know if parsing the string wouldn't be greedy.


Solution

  • A regular expression might fail if an attribute has a > as a value of an attribute.

    A safer way would be to use DomDocumment but note that the input should be valid HTML and also the output might possibly be standardized.

    <?php
    
    $htmlString = '<span>777</span><div class="hello">hello <b id="12">world</b></div>';
    $stripped = strip_tags($htmlString, '<div><b>');
    
    $dom = new DOMDocument;              // init new DOMDocument
    $dom->loadHTML($stripped);           // load the HTML
    $xpath = new DOMXPath($dom);
    $nodes = $xpath->query('//@*');
    foreach ($nodes as $node) {
        $node->parentNode->removeAttribute($node->nodeName);
    }
    
    $cleanHtmlString = '';
    foreach ($dom->documentElement->firstChild->childNodes as $node) {
        $cleanHtmlString .= $dom->saveHTML($node);
    }
    
    echo $cleanHtmlString;
    

    Output:

    <p>777</p>
    <div>hello <b>world</b>
    </div>