Search code examples
javahtmlstringescapeutils

Replacing of HTML 5 codes with equivalent characters in Java


I'm trying to replace symbols of HTML 5 using StringEscapeUtils.unescapeHtml4(), but I still have a lot of symbols which haven't been replaced such as "&nbsp"," &amp". What will you recommend to use?


Solution

  • &nbsp and &amp aren't entities.   and & are entities. If your string is really missing the ; on them, that's why they're not being decoded.

    I just checked (just to be thorough!), and StringEscapeUtils.unescapeHtml4 does correctly decode   and &.

    The correct fix is to fix whatever's giving you that string with the incomplete entities in it.

    You could workaround it, also turning &nbsp and &amp into \u00A0 and & using String#replace after using StringEscapeUtils.unescapeHtml4:

    // Ugly, technically-incorrect workaround (but we do these things sometimes)
    String result =
        StringEscapeUtils.unescapeHtml4(sourceString)
        .replace("&nbsp", "\u00A0")
        .replace("&amp", "&");
    

    ...but it's not correct, because those aren't entities. Best to correct the string.