Search code examples
javaunicodeutf-8notepad++byte-order-mark

Writing UTF-8 without BOM


This code,

OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
out.write("A".getBytes());

And this,

OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
out.write("A".getBytes(StandardCharsets.UTF_8));

produce the same result(in my opinion), which is UTF-8 without BOM. However, Notepad++ is not showing any information about encoding. I'm expecting notepad++ to show here as Encode in UTF-8 without BOM, but no encoding is being selected in the "Encoding" menu.

Now, this code write the file in UTF-8 with BOM encoding.

 OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
 byte[] bom = { (byte) 239, (byte) 187, (byte) 191 };
 out.write(bom);
 out.write("A".getBytes()); 

Notepad++ is also displaying the encoding type as Encode in UTF-8.

Question: What is wrong with the first two codes which are suppose to write the file in UTF-8 without BOM? Is my Java code doing the right thing? If so, is there a problem with notepad++ trying to detect the encoding type?

Is notepad++ only guessing around?


Solution

  • "A" written using UTF-8 without a BOM produces exactly the same file as "A" written using ASCII or ISO-8859-* or any other ASCII-compatible encodings. That file contains a single byte with the decimal value 65.

    Think of it this way:

    • "A".getBytes("UTF-8") returns a new byte[] { 65 }
    • "A".getBytes("ISO-8859-1") returns a new byte[] { 65 }
    • You write the results of those calls into a file
    • How is the consumer of the file supposed to distinguish the two?

    There's nothing in that file that suggests that UTF-8 needs to be used to decode it.

    Try writing "Käsekuchen" or something else that's not encodable in ASCII and see if Notepad++ guesses the encoding correctly (because that's exactly what it does: it makes an educated guess, there's no metadata that tells it which encoding to use).