Search code examples
javahtmlescapinghtml-escape-characters

Is there a Java utility to verify if a string is a valid HTML escape character?


I want a method in the following format:

public boolean isValidHtmlEscapeCode(String string);

Usage would be:

isValidHtmlEscapeCode("A") == false
isValidHtmlEscapeCode("ש") == true // Valid unicode character
isValidHtmlEscapeCode("ש") == true // same as 1513 but in HEX
isValidHtmlEscapeCode("�") == false // Invalid unicode character

I wasn't able to find anything that does that - is there any utility that does that? If not, is there any smart way to do it?


Solution

  • public static boolean isValidHtmlEscapeCode(String string) {
        if (string == null) {
            return false;
        }
        Pattern p = Pattern
                .compile("&(?:#x([0-9a-fA-F]+)|#([0-9]+)|([0-9A-Za-z]+));");
        Matcher m = p.matcher(string);
    
        if (m.find()) {
            int codePoint = -1;
            String entity = null;
            try {
                if ((entity = m.group(1)) != null) {
                    if (entity.length() > 6) {
                        return false;
                    }
                    codePoint = Integer.parseInt(entity, 16);
                } else if ((entity = m.group(2)) != null) {
                    if (entity.length() > 7) {
                        return false;
                    }
                    codePoint = Integer.parseInt(entity, 10);
                } else if ((entity = m.group(3)) != null) {
                    return namedEntities.contains(entity);
                }
                return 0x00 <= codePoint && codePoint < 0xd800
                        || 0xdfff < codePoint && codePoint <= 0x10FFFF;
            } catch (NumberFormatException e) {
                return false;
            }
        } else {
            return false;
        }
    }
    

    Here's the set of named entities http://pastebin.com/XzzMYDjF