I have a script which allows to replace undesired HTML tags and escape quotes to "improve" security and prevent mainly script tag and onload injection, etc.... This script is used to "texturize" content retrieved from innerHTML
.
However, it multiples near by 3 my execution time (in a loop). I would like to know if there is a better way or better regex to do it:
function safe_content( text ) {
text = text.replace( /<script[^>]*>.*?<\/script>/gi, '' );
text = text.replace( /(<p[^>]*>|<\/p>)/g, '' );
text = text.replace( /'/g, '’' ).replace( /'/g, '’' ).replace( /[\u2019]/g, '’' );
text = text.replace( /"/g, '”' ).replace( /"/g, '”' ).replace( /"/g, '”' ).replace( /[\u201D]/g, '”' );
text = text.replace( /([\w]+)=&#[\d]+;(.+?)&#[\d]+;/g, '$1="$2"' );
return text.trim();
};
EDIT: here a fiddle: https://fiddle.jshell.net/srnoe3s4/1/. Fiddle don't like script
tags in javascript string apparently so I didn't add it.
I'll just deal with performance and naive security checks since writing a sanitizer is not something you can do on the corner of a table. If you want to save time, avoid calling multiple times replace()
if you replace with the same value, which leads you to this:
function safe_content( text ) {
text = text.replace( /<script[^>]*>.*?<\/script>|(<\/?p[^>]*>)/gi, '' );
text = text.replace( /'|'|[\u2019]/g, '’');
text = text.replace( /"|"|"|[\u201D]/g, '”' )
text = text.replace( /([\w]+)=&#[\d]+;(.+?)&#[\d]+;/g, '$1="$2"' );
return text.trim();
};
If you take into account dan1111's comment about weird string input that will break this implementation, you can add while(/foo/.test(input))
to avoid the issue:
function safe_content( text ) {
while(/<script[^>]*>.*?<\/script>|(<\/?p[^>]*>)/gi.test(text))
text = text.replace( /<script[^>]*>.*?<\/script>|(<\/?p[^>]*>)/gi, '' );
while(/'|'|[\u2019]/g.test(text))
text = text.replace( /'|'|[\u2019]/g, '’');
while(/"|"|"|[\u201D]/g.test(text))
text = text.replace( /"|"|"|[\u201D]/g, '”' )
while(/([\w]+)=&#[\d]+;(.+?)&#[\d]+;/g.test(text))
text = text.replace( /([\w]+)=&#[\d]+;(.+?)&#[\d]+;/g, '$1="$2"' );
return text.trim();
};
in standard tests cases, this will not be a lot slower than the previous code. But if the input enter in the scope of dan1111's comment, it might be slower. See perf demo