I'm writing a commenting system kind of like here on stackoverflow and I'm unsure of the best way to sanitize user content before outputting.
I really want to sanitize the content when outputting it to the page, because I can think of all sorts of problems that could happen down the road if I sanitize it before inserting it into the database.
Up until now, I have always simply run my user content through
htmlentities($content, ENT_QUOTES, 'UTF-8')
Which from what I understand makes it safe to output.
However, the WYSIWYG editor I'm using for my commenting system allows the following HTML tags for formatting:
<code><span><div><label><a><br><p><b><i><del><strike><u><img><video><audio><iframe><object><embed><param><blockquote><mark><cite><small><ul><ol><li><hr><dl><dt><dd><sup><sub><big><pre><code><figure><figcaption><strong><em><table><tr><td><th><tbody><thead><tfoot><h1><h2><h3><h4><h5><h6>
So I need to be able to output those tags instead of encoding them in order for the comments to display correctly.
The documentation for the WYSIWYG editor I'm using (Redactor) recommends running the user content through strip_tags()
, passing the above tags as the allowed tags argument. However, questions and answers I've read on stackoverflow have suggested this may not be sufficient.
Operating under the assumption strip_tags()
isn't good enough, I've been looking into alternatives and it seems one of the most well-regarded options is HTML Purifier. However, I keep reading questions and answers on here suggesting HTML Purifier is extremely slow.
Because of the way the comments will be rendered, each comment will have to be individually purified (I can't do all of them as one string), and I'm wondering if this will simply be too slow with HTML Purifier if there are dozens or even hundreds of comments in a thread.
Summary:
The trick is to store two copies of the user input: the clean version and the purified one (i.e., a cache). In fact, the HTML Purifier documentation comments on this, and gives you some recipes for how to do it: http://htmlpurifier.org/docs/enduser-slow.html