Search code examples
javaencodingutf-8character-encodingutf-16

Java File parsing toolkit design, quick file encoding sanity check


(Disclaimer: I looked at a number of posts on here before asking, I found this one particularly helpful, I was just looking for a bit of a sanity check from you folks if possible)

Hi All,

I have an internal Java product that I have built for processing data files for loading into a database (AKA an ETL tool). I have pre-rolled stages for XSLT transformation, and doing things like pattern replacing within the original file. The input files can be of any format, they may be flat data files or XML data files, you configure the stages you require for the particular datafeed being loaded.

I have up until now ignored the issue of file encoding (a mistake I know), because all was working fine (in the main). However, I am now coming up against file encoding issues, to cut a long story short, because of the nature of the way stages can be configured together, I need to detect the file encoding of the input file and create a Java Reader object with the appropriate arguments. I just wanted to do a quick sanity check with you folks before I dive into something I can't claim to fully comprehend:

  1. Adopt a standard file encoding of UTF-16 (I'm not ruling out loading double-byte characters in the future) for all files that are output from every stage within my toolkit
  2. Use JUniversalChardet or jchardet to sniff the input file encoding
  3. Use the Apache Commons IO library to create a standard reader and writer for all stages (am I right in thinking this doesn't have a similar encoding-sniffing API?)

Do you see any pitfalls/have any extra wisdom to offer in my outlined approach?

Is there any way I can be confident of backwards compatibility with any data loaded using my existing approach of letting the Java runtime decide the encoding of windows-1252?

Thanks in advance,

-James


Solution

  • Option 1 strikes me as breaking backwards compatibility (certainly in the long run), although the "right way" to go (the right way option generally does break backwards compatibility) with perhaps additional thoughts about if UTF-8 would be a good choice.

    Sniffing the encoding strikes me as reasonable if you have a limited, known set of encodings that you tested to know that your sniffer correctly distinguishes and identifies.

    Another option here is to use some form of meta-data (file naming convention if nothing else more robust is an option) that lets your code know that the data was provided according to the UTF-16 standard and behave accordingly, otherwise convert it to the UTF-16 standard before moving forward.