filter out encoded javascript content from request

I have a problem where I am trying to cleanse the request content to strip out HTML and javascript if included in the input parameters.

This is basically to protect against XSS attacks and the ideal mechanism would be to validate input and encode the output but due to some restrictions I cannot work on the output end.

All I can do at this time is to try to cleanse the input through a filter. I am using ESAPI to canonicalize the input parameters and also using jsoup with the most restrictive Whitelist.none() option to strip all HTML.

This works as long as the malicious javascript is within some HTML tags but fails for a URL with javascript code without any HTML surrounding it, eg:

http://example.com/index.html?a=40&b=10&c='-prompt``-'

ends up showing an alert on the page. This is kind of what I am doing right now:

param = encoder.canonicalize(param, false, false);
param = Jsoup.clean(param, Whitelist.none());

So the question is:

Is there some way through which I can make sure that my input is stripped of all HTML and javascript code at the filter?
Should I throw in some regex validations but is there any regex that will take care of the cases that are getting past the check I have right now?

Solution

DISCLAIMER:

If output-escaping is not allowed in your internet-facing solution, you are in a NO-WIN SCENARIO. It's like antivirus on Windows: You'll be able to detect specific and known attacks, but you will be unable to detect or defend against unknown attacks. If your employer insists on this path, your due diligence is to make management aware of this fact and get their acceptance of the risks in writing. Every time I've confronted management with this, they've opted for the correct solution--output escaping.

================================================================

First off... watch out when using JSoup in any kind of a cleaning/filtering/input validation situation.

Upon receiving invalid HTML, like

<script>alert(1);

Jsoup will add in the missing </script> tag.

This means that if you're using Jsoup to "cleanse" HTML, it first transforms INVALID HTML into VALID HTML, before it begins processing.

So the question is: Is there some way through which I can make sure that my input is stripped of all HTML and javascript code at the filter? Should I throw in some regex validations but is there any regex that will take care of the cases that are getting past the check I have right now?

No. ESAPI and ESAPI's input validation is not appropriate for your use case because HTML is not a regular language and ESAPI's input for its validation are Regular Expressions. The fact is you cannot do what you ask:

Is there some way through which I can make sure that my input is stripped of all HTML and javascript code at the filter?

And still have a functioning web application that requires user-defined HTML/JavaScript.

You can stack the deck in your favor a little bit: I would choose something like OWASP's HTML Sanitizer. and test your implementation against the XSS inputs listed here.

Many of those inputs are taken from OWASP's XSS Filter evasion cheat sheet, and will at least exercise your application against known attempts. But you will never be secure without output escaping.

===================UPDATE FROM COMMENTS==================

SO the use case is to try and block all html and javascript. My recommendation is to implement caja since it encapsulates HTML, CSS, and Javascript.

Javascript though is also difficult to manage from input validation, because like HTML, JavaScript is a non-regular language. Additionally, each browser has its own implementation that deviates in different ways from the ECMAScript spec. If you want to protect your input from being interpreted, this means you'd ideally have to have a parser for each browser family attempting to interpret user input in order to block it.

When all you've really got to do is make sure that the output is escaped. Sorry to beat a dead horse, but I have to stress that output escaping is 100x more important than rejecting user input. You want both, but if forced to choose one or the other, output escaping is less work overall.