Search code examples
javahtmlstringencodinghtml-manipulation

Java to Upper Case Ignoring HTML Special Characters


How do I convert string to upper case String.toUpperCase() ignoring special characters like   and all others. The problem is that it becomes   and browser does not recognize them as special HTML characters.

I came up with this but it does not cover all special characters:

public static String toUpperCaseIgnoreHtmlSymbols(String str){
    if(str == null) return "";
        str = str.trim();
    str = str.replaceAll("(?i) "," ");
    str = str.replaceAll(""",""");
    str = str.replaceAll("&","&");
    //etc.
    str = str.toUpperCase();
    return str;
}

Solution

  • Are you only interested in skipping HTML Entities, or do you also want to skip tags? What about chunks of javascript? URL's in links?

    If you need to support that kind of stuff, you won't be able to avoid using a 'real' HTML parser instead of a regex. For example, parse the document using jsoup, manipulate the resulting Document, and convert it back to HTML:

    private String upperCase(String str) {
        Document document = Jsoup.parse(str);
        upperCase(document.body());
        return document.html();
    }
    
    private void upperCase(Node node) {
        if (node instanceof TextNode) {
            TextNode textnode = (TextNode) node;
            textnode.text(textnode.text().toUpperCase());
        }
        for (Node child : node.childNodes()) {
            upperCase(child);
        }
    }
    

    now:

    upperCase("This is some <a href=\"http://arnout.engelen.eu\">text&nbsp;with&nbsp;entities</a>");
    

    will produce:

    <html>
      <head></head>
      <body>
        THIS IS SOME 
        <a href="http://arnout.engelen.eu">TEXT&nbsp;WITH&nbsp;ENTITIES</a>
      </body>
    </html>