Search code examples
javaunicodebyte-order-markfilechannel

Preventing Unicode Byte Order Mark to be written in the middle of a file


This code writes two strings in a file channel

final byte[] title = "Title: ".getBytes("UTF-16");
final byte[] body = "This is a string.".getBytes("UTF-16");
ByteBuffer titlebuf = ByteBuffer.wrap(title);
ByteBuffer bodybuf = ByteBuffer.wrap(body);
FileChannel fc = FileChannel.open(p, READ, WRITE, TRUNCATE_EXISTING);
fc.position(title.length); // second string written first, but not relevant to the problem
while (bodybuf.hasRemaining()) fc.write(bodybuf);
fc.position(0);
while (titlebuf.hasRemaining()) fc.write(titlebuf);

Each string is prefixed by a BOM.

[Title: ?T]  *254 255* 0 84 0 105 0 116 0 108 0 101 58 0 32 *254 255* 0 84

While this is ok to have one at the beginning of the file, this creates a problem when there is one in the middle of the stream.

How can I prevent this to happen?


Solution

  • the BOM bytes is inserted when you call get UTF-16 with BOM:

    final byte[] title = "Title: ".getBytes("UTF-16");
    

    check the title.length and you will find it contains additional 2 bytes for BOM marker

    so you could process these arrays and remove the BOM from it before wrapp into ByteBuffer, or you can ignore it when you write ByteBuffer to file

    other solution, you can use UTF-16 Little/BIG Endianness which will not write BOM marker:

    final byte[] title = "Title: ".getBytes("UTF-16LE"); 
    

    or you can use UTF-8 if UTF-16 is not required:

    final byte[] title = "Title: ".getBytes("UTF-8");