Search code examples
javastringcharacter-encoding

How to check the charset of string in Java?


In my application I'm getting the user info from LDAP and sometimes the full username comes in a wrong charset. For example:

ТеÑÑ61 ТеÑÑовиÑ61

It can also be in English or in Russian and displayed correctly. If the username changes it's updated in database. Even if I change the value in the db it wont solve the problem.

I can fix it before saving by doing this

new String(incorrect.getBytes("ISO-8859-1"), "UTF-8");

However, if I will use it for the string including characters in Russian (for ex., "Тест61 Тестович61") I get something like this "????61 ????????61".

Can you please suggest something that can determine the charset of string?


Solution

  • Strings in java, AFAIK, do not retain their original encoding - they are always stored internally in some Unicode form. You want to detect the charset of the original stream/bytes - this is why I think your String.toBytes() call is too late.

    Ideally if you could get the input stream you are reading from, you can run it through something like this: http://code.google.com/p/juniversalchardet/

    There are plenty of other charset detectors out there as well