In PHP, what is the fastest and simplest way to strip all HTML tags from a string, except the ones in an allowed list but by removing all HTML attributes.
The built-in function strip_tags
would have done the job but the attributes are kept for the tags in the allowed list.
I don't know if using regular expressions is the best way and I also don't know if parsing the string wouldn't be greedy.
A regular expression might fail if an attribute has a >
as a value of an attribute.
A safer way would be to use DomDocumment
but note that the input should be valid HTML and also the output might possibly be standardized.
<?php
$htmlString = '<span>777</span><div class="hello">hello <b id="12">world</b></div>';
$stripped = strip_tags($htmlString, '<div><b>');
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($stripped); // load the HTML
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//@*');
foreach ($nodes as $node) {
$node->parentNode->removeAttribute($node->nodeName);
}
$cleanHtmlString = '';
foreach ($dom->documentElement->firstChild->childNodes as $node) {
$cleanHtmlString .= $dom->saveHTML($node);
}
echo $cleanHtmlString;
Output:
<p>777</p>
<div>hello <b>world</b>
</div>