Search code examples
javacharacter-encodingcentoscp1252

How to get Java to use the correct character set?


We've got our servers running on CentOS and our Java backend sometimes has to process a file that was originally generated on a Windows machine (by one of our clients) using CP-1252, however in 95%+ use cases, we are processing UTF-8 files.

My question: if we know that certain files will always be UTF-8, and other files will always be CP-1252, is it possible to specify in Java the character set to use for reading in each file? If so:

  • Do we need to do anything at the systems-level for adding CP-1252 to CentOS? If so, what does this involve?
  • What Java objects would we use to apply the correct encoding on a per file basis?

Thanks in advance!


Solution

  • My question: if we know that certain files will always be UTF-8, and other files will always be CP-1252, is it possible to specify in Java the character set to use for reading in each file?

    Assuming you're in charge of the code reading the file, it should be fine. Create a FileInputStream, then wrap it in an InputStreamReader specifying the relevant character encoding.

    Do we need to do anything at the systems-level for adding CP-1252 to CentOS? If so, what does this involve?

    That depends on what the JRE supports. I've never used CentOS, so I don't know whether it's likely to come with the relevant encoding as part of the JRE. You can use Charset.isSupported to check though, and Charset.availableCharsets to list what's available.