Search code examples
javahtml-parsing

javascript parser in java


I have a text box which can take any text including html and html embeded with javascript.

I need to validate this data through server side REST API which is implemented in java. Basically I need to do this validation for avoiding XSS vulnerability by not allowing any javascript data to get saved in my database.

When I will receive text from the above mentioned text box on server side API , it should throw error if html text embedded with java script is there but normal html text should be ok.

Example : In the above text box ,data as <svg onload=alert(document.cookie)/> should not be allowed but normal html text like <html><h1>this is test</h1></html> is allowed.

I tried using JSoup which is a HTML parsing library but I just need to verify if javascript is present in that text instead of checking for html tags.

Can anyone suggest a way to do this.


Solution

  • Since you are already parsing your HTML using JSoup, your next step is to traverse each element to check if they contain Javascript. Something like this code will check each element:

    boolean validateHtml(String html) {
      Document doc = Jsoup.parse(html);
      for(Element e : doc.getAllElements()) {
          if(detectJavascript(e)) {
              return false;
          }
      }
      return true;
    }
    
    private boolean detectJavascript(Element e) {
      if(/* Check if element contains javascript */) {
          return true;
      }
      return false;
    }
    

    Then, there are several checks you should perform inside detectJavacript function:

    • Of course, reject script elements: e.normalName​() == "script"
    • Reject elements with a value in any on* attribute (onload, onclick, etc). You have the complete list here but it's probably just enough to get all attributes with e.attributes​() and reject if any of them starts with "on".
    • Every attribute that accepts a URL (href, src, etc.) can contain a "javascript:" value that executes JavaScript. You should check all those too. For a complete (?) list of these attributes, check this other SO question.

    Finally, I advise not to store the original html into the database, even if it passes your validation. Instead convert the document parsed by JSoup again to html. This way you make sure you have a well-formed document free of any "dangerous" elements.