Search code examples
javadateencodingutf-8iso-8859-1

Unescape and convert string encoding


I have to parse a String to a Date object in Java. The string I get following the pattern MMM d yyyy HH:mm:ss z with locale set to French.

The problem occures when the date is in february, august or december due to encoding of french accents. For example, I get déc. 15 2011 16:55:38 CET for december 15th 2011.

I can't change the way the string is created so I have to deal with the bad encoding on my side. It seems that when generated the string is badly encoded (UTF-8 content encoded as ISO 8859-1) then escapde.

For now I use :

stringFromXML = stringFromXML.replaceAll("é", "é");
stringFromXML = stringFromXML.replaceAll("û", "û");

It works because the only accent in french month are é and û but is there a cleaner way to unescape and convert characters?


Solution

  • You need two steps:

    1. Resolve numeric character references, for example, using StringEscapeUtils as suggested by Andy:

      String unescaped = StringEscapeUtils.unescapeHtml(in);
      
    2. Fix encoding by treating characters as UTF-8 code units:

      String out = new String(unescaped.getBytes("ISO-8859-1"), "UTF-8");