Search code examples
javaencodingutf-8iso-8859-1

Java encoding - corrupted French characters


I have a system, where I got French Text from third party, but I am facing hard time to get it readable.

String frenchReceipt = "RETIR�E"; // The original Text should be "RETIRÉE"

I tried all possible combinations to convert the string using encoding of UTF-8 and ISO-8859-1

String frenchReceipt = "RETIR�E"; // The original Text should be "RETIRÉE"

byte[] b1 = new String(frenchReceipt.getBytes()).getBytes("UTF-8"); 
System.out.println(new String(b1));  // RETIR�E

byte[] b2 = new String(frenchReceipt.getBytes()).getBytes("ISO-8859-1"); 
System.out.println(new String(b2));  // RETIR�E

byte[] b3 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes(); 
System.out.println(new String(b3));  // RETIR?E 

byte[] b4 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes(); 
System.out.println(new String(b4));  //RETIR?E

byte[] b5 = new String(frenchReceipt.getBytes(), "ISO-8859-1").getBytes("UTF-8"); 
System.out.println(new String(b5));  //RETIR�E

byte[] b6 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes("ISO-8859-1"); 
System.out.println(new String(b6));  //RETIR?E

byte[] b7 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes("UTF-8"); 
System.out.println(new String(b7));  //RETIR�E

byte[] b8 = new String(frenchReceipt.getBytes(), "ISO-8859-1").getBytes("ISO-8859-1"); 
System.out.println(new String(b8));  //RETIR�E

As you see nothing fix the problem.

Please advise.

Update: The third -party partner confirmed that data sent to my application in "ISO-8859-1" Encoding


Solution

  • � is just a replacement character (EF|BF|BD UTF-8) and used to indicate problems when a system is unable to render a correct symbol. It means that you have no chance to convert � into É.

    frenchReceipt doesn't contain any byte sequence which could be converted into É because of the declaration:

    String frenchReceipt = "RETIR�E";
    

    Your code snippet below should work pretty fine but you have to use the correct byte source.

    byte[] b2 = new String(frenchReceipt.getBytes()).getBytes("ISO-8859-1");
    System.out.println(new String(b2));
    

    So if you read "RETIRÉE" by bytes from a data source and get 52|45|54|49|52|C9|45 (ISO-8859-1 is expected) then you'll get the proper result. If the data source has already the byte sequence EF|BF|BD the only option you have is search&replace, but in this case, there is no difference between i.e. ä and É.

    Update: Since the data are delivered by TCP

    new BufferedReader(new InputStreamReader(connection.getInputStream(),"ISO-8859-1"))
    

    solved the issue.