Search code examples
javatextutf-8utf-16utf

Reading text from file with UTF-16 BOM character


I am trying to make a generic way to get text from a file. Fairly easy, except that there is a requirement that it should discard leading BOM-characters. For UTF-8 I got this working. I used a regex pattern for that:

Pattern LEADING_BOM_PATTERN = Pattern.compile("^\uFEFF+");

Charset encoding; // This is given.
InputStream input; // This is created.

// Remove the leading BOM characters.
String text = IOUtils.toString(input, encoding);
text = LEADING_BOM_PATTERN.matcher(text).replaceFirst("");

Now my problem: this works perfectly for UTF-8 BOM characters (EF BB BF), but not for any of the other ones. However, as it states here:

The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format.

Which made me assume the "\uFEFF" character would work for all BOM characters. Turns out, it does not.

After some more reasearch, it turned out that both the "FE FF" and "FF FE" BOM characters are read as char 65533 by Java, while the "\uFEFF" string resolves to char 65279. That does clear up why the characters are not removed, but I don't believe it is expected behavior.

Can anyone shine some light on why it does this, or rather how to fix it? Thanks :)


Solution

  • Turns out it was just a really stupid mistake. I didn't pass the right encoding to the IOUtils. Hence it not returning the right characters. When passing UTF-16 charset it works fine. Silly me...